<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>Cs Notes</title>
	
	<link>http://csliu.com</link>
	<description>RTW: Read -&gt; Think -&gt; Write</description>
	<lastBuildDate>Tue, 03 Jan 2012 16:58:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/cs6notes" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="cs6notes" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>On User Credentials for Web Site</title>
		<link>http://csliu.com/2012/01/on-user-credentials-for-web-site/</link>
		<comments>http://csliu.com/2012/01/on-user-credentials-for-web-site/#comments</comments>
		<pubDate>Sun, 01 Jan 2012 08:30:38 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[WebArch]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=478</guid>
		<description><![CDATA[<p>There are several critical password leak events at the end of 2011 that happened in some leading Chinese internet companies, such as <a href="http://www.csdn.net">CSDN</a> (leading technology community), <a href="http://www.tianya.cn">Tianya </a>(leading discussion community) and <a href="http://www.renren.com">RenRen </a>(leading social networking). These leakages have big impact to many Chinese internet users&#8217; daily web life. So as a technical [...]]]></description>
			<content:encoded><![CDATA[<p>There are several critical password leak events at the end of 2011 that happened in some leading Chinese internet companies, such as <a href="http://www.csdn.net">CSDN</a> (leading technology community), <a href="http://www.tianya.cn">Tianya </a>(leading discussion community) and <a href="http://www.renren.com">RenRen </a>(leading social networking). These leakages have big impact to many Chinese internet users&#8217; daily web life. So as a technical guy, I did some investigation and make some summaries here to avoid such disasters if I were the product owner and also to make my Internet account more secure.</p>
<h3><strong>Part I &#8211; Technical Background</strong></h3>
<p><strong>Plain Tex v.s. Hash Text</strong><br />
- Store plain text of a password is dangerous in case of user data leakage, but it seems that almost all popular web sites do store it. At least it&#8217;s true in China.<br />
- Hashing is a way to transform the plain text into some meaningless (for people) strings that are almost impossible to covert back to original text. It&#8217;s more secure than plain text in terms of storing user password.<br />
- Typical Hash algorithms are: MD5, SHA1, SHA256, SHA512, SHA-3</p>
<p><strong>Attacking Hash</strong><br />
- With ideal hash algorithm, it&#8217;s impossible to convert hashed text back to original text directly, but people can accomplish this using dictionary or brute-force based approaches<br />
- Dictionary: attacker can precompute the hash value of popular passwords using some specific hash algorithm and compare the output with hashed text<br />
- Brute-force: enumerate/compute all possible password and compare it with hashed text</p>
<p><strong>Defense Hash Attacking</strong><br />
- Defense Dictionary based Attack<br />
* Using multiple hash functions together: there are only a few popular hash algorithms, so pre-computing and storing popular passwords&#8217; dict are cheap. But if you uses multiple hash functions in some order, attacking will become very slow and will not be practically due to huge potential result space. Alternatively, you can also hash plain password text multiple times using the same algorithm.<br />
* Write your own hash function, thus the attacker can&#8217;t do the pre-computation.<br />
* Add salt to plain user password before hashing it to secure text.<br />
- Defense Brute-force based Attack<br />
* Adopt heavy hashing function, for example, the BCRYPT algorithm.<br />
* Write your own hash algorithm.</p>
<p><strong>Rainbow Table</strong><br />
- It&#8217;s a variant of the naive dictionary based hash/encryption attacking that reduces spaces to store precomputed dict with the cost of more CPU during precomputing and looking up.<br />
- It&#8217;s based on the idea of hash chain:chain a series of text with hashing/reduction, store just the head and tail, intermediate texts can be computed during looking up.<br />
- Rainbow table further improved hash chain&#8217;s collision problem by adopting different reduction function in each position in the chain.<br />
- Detailed description can be found at: <a href="http://en.wikipedia.org/wiki/Rainbow_table">wikipedia on Rainbow Table</a>.</p>
<p><strong>Salt for Hashing</strong><br />
- Essentially, it&#8217;s just a simple trick to avoid simple/popular password text by adding some extra value to original plain text before hashing it.<br />
- In fact, adding salt during hashing is a form of multiple hashing.<br />
- Salt can be static (a fixed value) or dynamic (generated from plain password text).</p>
<h3><strong>Part II &#8211; End User&#8217;s Perspective</strong></h3>
<p>Given previous knowledge, how to make password more secure as an end user?<br />
- Avoid short password<br />
Short password is easy to attack using either dict or brute-force based approach<br />
- Avoid simple/popular password, there are some popular password listed in the reference section<br />
Dict based attack can crack simple or popular password efficiently. This is why some website requires your password to contain some non-alphabetical characters<br />
- Use different password for different web site<br />
Otherwise, one weak web site may expose all your online assets to attackers. To better manage these large amount of passwords, you may consider defining some rules for them. For example:<br />
* define some password base: tqbfjotlb (from: the quick brown fox jumps over the lazy dog)<br />
* define a rule to change the base for specific site: gmailtqbfjotlb for gmail, csdntqbfjotlb for CSDN<br />
- Change your password often<br />
Change the previous two rules from time to time<br />
- Adopt password management software<br />
If it&#8217;s hard for you to track many passwords for different web sites, you can use popular password management software such as: <a href="http://keepass.info/">keepass</a></p>
<h3><strong>Part III &#8211; Developer&#8217;s Perspective</strong></h3>
<p>Here I summarized some tips on user password related developing.</p>
<p><strong>1. Writing your own hash function</strong></p>
<p>It&#8217;s very challenge (if not impossible) to write an ideal hash function for encryption that meets the &#8220;ideal&#8221; criteria:<br />
- no two different inputs have the same hash value<br />
- infeasible to recover the input from the hash value</p>
<p>That&#8217;s probably one reason that there are very few hash algorithms for encryption. But you can write a sub-ideal (but it&#8217;s your own version, not known by others) algorithm based on a near-ideal one, such as MD5 and SHA1. One simple way to do this is write another hash function H before hashing it with MD5. And you can give up the first criteria but ensure the second one. To ensure the second one, you can do some loosely conversion, for example, drop the middle letter of the input text. Since you drop some information during the conversion, it&#8217;s infeasible to completely recover to the original input.</p>
<p><strong>2. Enforce strict password rules</strong></p>
<p>To avoid user using popular and simple password, web site developer may consider enforce some restricts on valid password:<br />
- Enable black list filter, forbid popular passwords.<br />
- Check password length, forbid short passwords.<br />
- Invalid simple text, password should contains both lower case and upper case letter, numbers and other type of characters.<br />
- Password should not contain user name information.<br />
- Should not equal to previous passwords in history</p>
<p><strong>3. On hashing algorithm</strong></p>
<p>To avoid exposing the actual hashing algorithm, you can consider:<br />
- Don&#8217;t adopt well known algorithm natively<br />
- Combine multiple algorithm together<br />
- Combine well known algorithm with your own hashing function<br />
- Provide hashing with salt</p>
<p><strong>4. Secure your transport channel</strong></p>
<p>You always need some transport channel to send user provided name and password to your server. Ensuring these channels&#8217; security is also very critical. To this end:<br />
- Prefer https over http<br />
- Consider client side (for example, in java script) encryption before transfer it to server side</p>
<p><strong>5. Defense online cracking</strong></p>
<p>- Adopt CAPTCHA<br />
Typical attacker will use computer programs rather than real human to try to login online web site. To tell whether the logining user is a computer program or a real human being, you can adopt CAPTCHA in your online system.<br />
To avoid downgrading user experience, you can trigger the CAPTCHA only when suspicious.</p>
<p>- Adopt multiple channel verification<br />
If current user has suspicious behavior, such as: too many incorrect inputs, not in normal location, interact too fast. Multiple channel verification can be triggered:<br />
* user have to provide some secure code sent to his mobile phone or email.<br />
* user need to wait for some time.<br />
* user need to pass CAPTCHA test.</p>
<p><strong>6. Adopt existing proven ID system</strong></p>
<p>If you don&#8217;t want to touch all previous tedious stuff, you can consider adopt existing ID system that is proven to work well. There are many such system, such as:<strong> OpenID, OAuth</strong> and QQ Login service</p>
<p><strong>7. Other authentication related developing tips:</strong></p>
<ul>
<li><a href="http://www.owasp.org/index.php/Guide_to_Authentication" rel="nofollow">OWASP Guide To Authentication</a></li>
<li><a href="http://www.cs.umass.edu/%7Ekevinfu/papers/webauth_tr.pdf" rel="nofollow">Dos and Don’ts of Client Authentication on the Web </a>（PDF）</li>
<li><a href="http://fishbowl.pastiche.org/2004/01/19/persistent_login_cookie_best_practice/" rel="nofollow">Charles Miller’s Persistent Login Cookie Best Practice</a></li>
<li><a href="http://en.wikipedia.org/wiki/HTTP_cookie#Drawbacks_of_cookies" rel="nofollow">Wikipedia: HTTP cookie</a></li>
<li><a href="http://cups.cs.cmu.edu/soups/2008/proceedings/p13Rabkin.pdf" rel="nofollow">Personal knowledge questions for fallback authentication: Security questions in the era of Facebook </a></li>
</ul>
<p>For other website security issues, be careful about: <a href="http://en.wikipedia.org/wiki/SQL_injection">SQL Injection</a>, <a href="http://en.wikipedia.org/wiki/Cross-site_scripting">Cross Site Scripting</a>, <a href="http://en.wikipedia.org/wiki/Session_hijacking">Session Hijacking</a></p>
<h3><strong>[Reference]</strong></h3>
<p>1. Hashing algorithm:<br />
- <a href="http://codahale.com/how-to-safely-store-a-password/">About BCrypt</a><br />
- <a href="http://en.wikipedia.org/wiki/Cryptographic_hash_function">MD5, SHA1, SHA256, SHA512, SHA-3</a><br />
- <a href="http://en.wikipedia.org/wiki/Rainbow_table">Rainbow Table</a></p>
<p>2. Bad password list:<br />
- <a href="http://www.whatsmypass.com/the-top-500-worst-passwords-of-all-time">Top 500 bad passwords</a><br />
- <a href="https://twitter.com/signup">Twitter password black list</a> (see source code)</p>
<p>3. Handbook about web security:<br />
- <a href="http://code.google.com/p/browsersec/wiki/Main" rel="nofollow">The Google Browser Security Handbook</a><br />
- <a href="http://rads.stackoverflow.com/amzn/click/0470170778" rel="nofollow">The Web Application Hacker’s Handbook</a></p>
<p>4. <a href="http://programmers.stackexchange.com/questions/46716/what-should-every-programmer-know-about-web-development">Web developers&#8217; must know</a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2012/01/on-user-credentials-for-web-site/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BookNotes – Morningstar’s Stock Tutorial</title>
		<link>http://csliu.com/2011/12/booknotes-morningstars-stock-tutorial/</link>
		<comments>http://csliu.com/2011/12/booknotes-morningstars-stock-tutorial/#comments</comments>
		<pubDate>Sun, 25 Dec 2011 14:50:31 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[BookReview]]></category>
		<category><![CDATA[Business]]></category>
		<category><![CDATA[Management]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=470</guid>
		<description><![CDATA[<p>Part I &#8211; Foundations of Investment and Stock</p> <p><a href="http://v2work.bokee.com/viewdiary.182467108.html">101.Stocks Versus Other Investments</a></p> <p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182482240.html">102.The Magic of Compounding </a></p> <p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182562006.html">103.Investing for the Long Run</a></p> <p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607604.html">104.What Matters and What Doesn&#8217;t </a></p> <p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607608.html">105.The Purpose of a Company </a></p> <p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607644.html">106.Gathering Relevant Information </a></p> <p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607779.html">107.Introduction to [...]]]></description>
			<content:encoded><![CDATA[<div>
<p><strong>Part I &#8211; Foundations of Investment and Stock</strong></p>
<table width="887" border="0" cellspacing="0" cellpadding="0" align="left">
<tbody>
<tr>
<td style="text-align: left;" width="385">
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182467108.html"><span style="font-size: medium;">101.Stocks Versus Other Investments</span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182482240.html"><span style="font-size: medium;">102.The Magic of Compounding </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182562006.html"><span style="font-size: medium;">103.Investing for the Long Run</span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607604.html"><span style="font-size: medium;">104.What Matters and What Doesn&#8217;t </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607608.html"><span style="font-size: medium;">105.The Purpose of a Company </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607644.html"><span style="font-size: medium;">106.Gathering Relevant Information </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182607779.html"><span style="font-size: medium;">107.Introduction to Financial Statements </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.182632416.html"><span style="font-size: medium;">108.Learn the Lingo&#8211;Basic Ratios </span></a></p>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div>
<p>Personal Investment Choices<br />
- Stock: 股票<br />
- Bond: 债券<br />
- Mutual Found: 共同基金<br />
- Real Estate:不动产<br />
- Bank Saving: 银行储蓄</p>
<p>Understanding Company, Stock/Shareholder and Bond/Creditor<br />
- The main purpose of a Company is to take money from investors (creditors and shareholders) and generate profits on their investments.<br />
- Creditors provide a company with debt capital (in terms of Bond), and Shareholders provide a company with equity capital (in terms of Share). Stock is an ownership interest in a company, while Bond, at their most basic, are loans. When you buy a bond, you become a lender to an institution, and that institution pays you interest. As long as the institution does not go bankrupt, it will also pay back the principal on the bond, but no more than the principal.<br />
- Creditors are typically banks, bondholders, and suppliers. They lend money to companies in exchange for a fixed return on their debt capital, usually in the form of interest payments. Companies also agree to pay back the principal on their loans.<br />
- Shareholders that supply companies with equity capital are typically banks, mutual or hedge funds, and private investors. They give money to a company in exchange for an ownership interest in that business. Unlike creditors, shareholders do not get a fixed return on their investment because they are part owners of the company. When a company sells shares to the public (in other words, &#8220;goes public&#8221; to be &#8220;publicly traded&#8221;), it is actually selling an ownership stake in itself.</p>
<p>The Great Compound Interest(复利)<br />
- Compound Interest means making returned interest as investment and it can increase your money in a surprising rapid way. A simple way to know the time it takes for money to double is to use the rule of 72. For example, if you wanted to know how many years it would take for an investment earning 12% to double, simply divide 72 by 12, and the   answer would be approximately six years. The reverse is also true. If you wanted to know what interest rate you would have to earn to double your money in five years, then divide 72 by five, and the answer is about 15%.</p>
<p><strong>Part II &#8211; Stock Market and Qualitative Corporate Analysis</strong></p>
</div>
<div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182632417.html"><span style="font-size: medium;">201.Stocks and Taxes </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182657395.html"><span style="font-size: medium;">202.Using Financial Services Wisely&#8211;Choose Broker </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182657708.html"><span style="font-size: medium;">203.Understanding the News </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182662031.html"><span style="font-size: medium;">204.Start Thinking Like an Analyst </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182782042.html"><span style="font-size: medium;">205.Economic Moats </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182782045.html"><span style="font-size: medium;">206.More on Competitive Positioning </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182822091.html"><span style="font-size: medium;">207.Weighting Management Quality </span></a></p>
</div>
</div>
<div>
<p> Shorting:做空</p>
<p>Stock Index<br />
-  A stock index is simply the price of a grouping or a composite of a number of different stocks, often with similar characteristics.<br />
- Three of the most widely followed indexes are the Dow Jones Industrial Average, the S&amp;amp;P 500, and the Nasdaq Composite.<br />
- The Dow Jones Industrial Average: it is composed of 30 large stocks from a wide spectrum of industries that are selected by the editors of The Wall Street Journal. It&#8217;s basically the average of the price of the 30 stocks, but had been adjusted a lot due to stock split like events.<br />
- The S&amp;amp;P 500: it is a market capitalization weighted average stock index. The company list in maintained by the Standard &amp;amp; Poor&#8217;s company, a division of McGraw-Hill.<br />
- The Nasdaq Composite: it is also a market-cap-weighted index, but it includes all companies listed in Nasdaq.</p>
<p>How to do Qualitative Analysis on Business? Ask and try to Answer questions:<br />
- What is the goal of the business?<br />
- How does the business make money?<br />
- How well is the business actually doing?<br />
- How well is the business positioned relative to its competitors?</p>
<p>Analyze Competitive Positioning of Business<br />
- Find a business&#8217;s economic moat, which is a long-term competitive advantage that allows a company to earn oversized profits over time.<br />
- Economic Moat Types:<br />
Low Cost (due to scale or core technology)<br />
High Switching Cost (user sticky)<br />
Network Effect (ecosystem, scale/size matters, <span>Matthew effect, winner takes over)</span><br />
Intangible Assets (government approvals, brand names etc.)</p>
<p>How to Build Economic Moat?<br />
- Creating real or perceived product differentiation<br />
- Driving costs down and being a low-cost leader<br />
- Locking in customers by creating high switching costs<br />
- Locking out competitors by creating high barriers to entry or high barriers to success</p>
<p>Understand Strategic Positioning using Porter&#8217;s Five Forces<br />
- Barriers to Entry. How easy is it for new firms to start competing in a market? Higher barriers are better.<br />
- Buyer (Customer) Power. Similar to switching costs, what keeps customers locked in or causes them to jump ship if prices were to increase? Lower power is better.<br />
- Supplier Power. How well can a company control the costs of its goods and services? Lower power is better.<br />
- Threat of Substitutes. A company may be the best widget maker, but what if widgets will soon become obsolete? Also, are there cheaper or better alternatives?<br />
- Degree of Rivalry. Including the four factors above, just how competitive is a company&#8217;s industry? Are companies beating one another bloody over every last dollar? How often are moats trying to be breached and profits being stolen away?</p>
<p>Porter&#8217;s five forces considered together can help you to determine whether a firm has an economic moat. The framework is particularly useful for examining a firm&#8217;s external competitive environment.</p>
<p><strong>Part III &#8211; Accounting and Quantitative Corporate Analysis</strong><span style="font-size: medium;"><br />
</span></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182892029.html"><span style="font-size: medium;">301.The Income Statement </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.182897078.html"><span style="font-size: medium;">302.The Balance Sheet </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.183097243.html"><span style="font-size: medium;">303.The Statement of Cash Flows </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.183097359.html"><span style="font-size: medium;">304.Interpreting the Numbers </span></a></p>
</div>
<div>
<p><a href="http://v2work.bokee.com/viewdiary.183112029.html"><span style="font-size: medium;">305.Quantifying Competitive Advantages </span></a></p>
</div>
<div>
<p>关于财务报表，除了两点值得强调，其它值得注意的都是英语单词的中文翻译</p>
<p>一、财务恒等式：<strong>Assets &#8211; Liabilities = Equity  </strong></p>
<p>二、Income Statement 与 Statement of Cash Flow的区别<br />
- 根源在于<strong>Accrual Accounting(权责发生制)</strong>的会计原则，它要求companies to record revenue and expense when corresponding transactions occur, not when cash is exchanged</p>
<p>Operating Activities &#8211; 运营活动</p>
</div>
<div>
<p>Investing  Activities &#8211; 投资活动</p>
</div>
<div>
<p>Financing Activities &#8211; 融资活动</p>
</div>
<div>
<p>Capital Expenditure (a.k.a. CapEx) &#8211; 资本支出</p>
</div>
<div>
<p>Monetary Investment &#8211; 货币投资</p>
</div>
<div>
<p>Dilute &#8211; 稀释，用作除数的分母变大</p>
<p>Depreciation &#8211; 折旧摊销 (有形资产)</p>
<p>Amortization &#8211; 费用摊销 (无形资产)</p>
<p>Retained Interest &#8211; 未分配利润</p>
<p>Treasury Stock &#8211; 留存股票</p>
<p>公司财报分析指标：</p>
<p>- Efficiency<br />
Inventory Turnover<br />
Accounts Receivable Turnover<br />
Accounts Payable Turnover<br />
Asset Turnover</p>
<p>- Liquidity<br />
Current Ration<br />
Cash Ratio</p>
<p>- Leverage<br />
Debt/Equity<br />
Interest Coverage</p>
<p>- Profitability<br />
Gross Margin<br />
Operating Margin<br />
Net Margin<br />
Return on Assets<br />
Return on Equity</p>
</div>
<div>
<p> 公司竞争优势定量分析：</p>
</div>
<div>
<p>- ROA (Return on Assets) = Net Income / Average Assets<br />
= (Net Income / Revenue) * (Revenue / Average Assets)<br />
= Net Margin * Assets Turnover</p>
</div>
<div>
<p>- ROE (Return on Equity) = Net Income / Average Equity<br />
= (Net Income / Revenue) * (Revenue / Average Assets) * (Average Assets / Average Equity)<br />
= Net Margin * Assets Turnover *  Assets-Equity Ratio</p>
</div>
<div>
<p>- ROIC (Return on Invested Capital) = Operating Profit After Tax / Invested Capital<br />
= (Operating Profit * (1 &#8211; Tax Rate)) / (Assets &#8211; Excess Cash &#8211; Non-Interest-Bearing Current Liabilities)</p>
<p><strong>Part IV Stock Investment Analysis and Strategies</strong></p>
</div>
<div>
<table border="0" cellspacing="0" cellpadding="0" align="left">
<tbody>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183132257.html"><span style="font-size: medium;">401.Understanding Value </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183132301.html"><span style="font-size: medium;">402.Using Ratios and Multiples1 </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183339779.html"><span style="font-size: medium;">403.Using Ratios and Multiples2 </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183339781.html"><span style="font-size: medium;">404.Introduction to Discounted Cash Flow </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183339784.html"><span style="font-size: medium;">405.Putting DCF into Action </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183340922.html"><span style="font-size: medium;">406.The Fat-Pitch Strategy </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183340925.html"><span style="font-size: medium;">407.Using Morningstar&#8217;s Rating for Stocks </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183341234.html"><span style="font-size: medium;">408.Psychology and Investing </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183342186.html"><span style="font-size: medium;">409.The Case for Dividends </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183342254.html"><span style="font-size: medium;">410.The Dividend Drill </span></a></p>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div>
<p> Great company is different with great investment, your goal as an investor should be to find wonderful businesses, and invest in them at reasonable prices.</p>
<p>Company Valuation &#8211; determine the value of a company.</p>
<p>Measuring Business Value:<br />
Market Capitalization = Outstanding Share Count * Share Price<br />
Enterprise Value = Market Cap + Debt &#8211; Cash</p>
<p>There are actually two parts to the value of any business:<br />
- The first part is the current value of all the business&#8217;s assets and liabilities, including buildings, employees, inventories, and so forth.<br />
- The second part is the value of the profits the business is expected to make in the future.</p>
<p>There are two broad approaches to stock valuation. One is the ratio-based approach and the other is the intrinsic value approach:<br />
- Valuation ratios compare the company&#8217;s market value with some financial aspect of its performance&#8211;earnings, sales, book value, cash flow, and so on.<br />
- The ratio-based approach is the most commonly used method for valuing stocks, because ratios are easy to calculate and readily available. The downside is that making sense of valuation ratios requires quite a bit of context.<br />
- Popular ratio-based measures:<br />
Price / Sale<br />
Price / Earning<br />
Price / Book<br />
Cash Return = (Free Cash Flow + Net Interest Expense) / (Enterprise Value)<br />
- The other major approach to valuation tries to estimate what a stock should intrinsically be worth.<br />
- A stock&#8217;s intrinsic value is based on projecting the company&#8217;s future cash flows along with other factors. You can compare this intrinsic or fair value with a stock&#8217;s market price to determine whether the stock looks underpriced, fairly valued, or overpriced.<br />
- However, the main disadvantage is that estimating future cash flows and coming up with a fair value estimate requires a lot of time and effort.</p>
<p>Estimating a stock&#8217;s fair value, or intrinsic value using DCF model:<br />
- The main idea behind a DCF model is relatively simple: A stock&#8217;s worth is equal to the present value of all its estimated future cash flows.<br />
- Free cash flow represents the cash a company has left over after spending the money necessary to keep the company growing at its current rate.<br />
- Many variables go into estimating those cash flows, but among the most important are the company&#8217;s future sales growth and profit margins.<br />
- What cash flow to predicate and discount to present value? dividend payments -&amp;gt; free cash flow, because there are many firms that pay no dividends.<br />
- How to do the discounting?<br />
Present Value of Cash Flow in Year N =  CF at Year N / (1+ R)^N<br />
CF = Cash Flow<br />
R = Required Return (Discount Rate)<br />
N = Number of Years in the Future<br />
- The rate you would use to discount cash flows if using the &#8220;cash flow to the firm&#8221; method is actually a company&#8217;s weighted average cost of capital, or WACC. A company&#8217;s WACC accounts for both the firm&#8217;s cost of equity and its cost of debt, weighted according to the proportions of equity and debt in the company&#8217;s capital structure. Here&#8217;s the basic formula for WACC: (Weight of Debt) * (Cost of Debt)  +  (Weight of Equity)*(Cost of Equity)<br />
- Computing the intrinsic value of a company: sum of all discounted (to present) future free cash flows</p>
<p>The math tricks behind DCF<br />
- When counting the sum, we assume the company will generate cash flow all the time, but the growth number varies from near future to far future<br />
- We usually assign specific growth ratios to 5-10 near future but assume equal (relatively) small ratio to long term growth<br />
- Perpetuity Value: estimating the value of all cash flows after some specific year in one lump. It&#8217;s in fact the sum of geometric sequence with common ratio: (1+g)/(1+R)</p>
<p>Perpetuity Value = ( CFn * (1 +  g) ) / (R &#8211; g)<br />
CFn = Cash Flow in the Last Individual Year Estimated<br />
g = Long-Term Growth Rate<br />
R = Discount Rate, or Cost of Capital<br />
- Perpetuity value should also be discounted by to compute the present intrinsic value of a company</p>
<p>The main problem of DCF model to compute the intrinsic value is that, you have to determine many variable factors such as discount rate, growth rate for near future and long term. The estimating of these parameters are the real challenges for this model. To ensure your assumptions about these parameters make sense, you have to get familiar with those industries.</p>
<p>Stock Investing Strategy &#8211; Fat Pitch Strategy: Don&#8217;t Rush, Be Patient Till Enough Confident</p>
<p>- Locating Wide Moat Company<br />
- Always Have a Margin of Safety<br />
- Don&#8217;t Be Afraid to Hold Cash<br />
- Don&#8217;t Be Afraid to Hold Relatively Few Stocks<br />
- Don&#8217;t Trade Very Often</p>
<p>Investing Psychology: Mental Stuff that Leads to Investing Mistakes<br />
- overconfidence<br />
- selective memory<br />
- self-handicapping<br />
- loss aversion<br />
- sunk cost<br />
- anchoring: when estimating the unknown, we cleave to what we know<br />
- confirmation bias<br />
- mental accounting<br />
- framing effect<br />
- herding (羊群效应)</p>
<p><strong>Part V Misc Tips and Great Investors</strong></p>
</div>
<div>
<table width="887" border="0" cellspacing="0" cellpadding="0" align="left">
<tbody>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183407014.html"><span style="font-size: medium;">51.Constructing a Portfolio </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183432003.html"><span style="font-size: medium;">52.Introduction to Options </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183682102.html"><span style="font-size: medium;">53.Unconventional Equities </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183682105.html"><span style="font-size: medium;">54.Great Investors: Benjamin Graham </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183682202.html"><span style="font-size: medium;">55.Great Investors: Philip Fisher </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183842343.html"><span style="font-size: medium;">56.Great Investors: Warren Buffett </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183842346.html"><span style="font-size: medium;">57.Great Investors: Peter Lynch </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183857013.html"><span style="font-size: medium;">58.Great Investors: Others in the Hall of Fame </span></a></p>
</div>
</td>
</tr>
<tr>
<td>
<div>
<p style="text-align: left;"><a href="http://v2work.bokee.com/viewdiary.183857015.html"><span style="font-size: x-small;"><span style="font-size: medium;">59.20 Stock-Investing Tips</span></span></a></p>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<p>Portfolio Management<br />
- diversification: don&#8217;t put your eggs in one basket<br />
- if you own about 12 to 18 stocks, you have obtained more than 90% of the benefits of diversification, assuming you own an equally weighted<br />
portfolio<br />
- don&#8217;t weight each stock equally in your portfolio if you want to outperform market index<br />
- consider including mutual fund to cover area that you are not familiar with</p>
<p>Options<br />
- option is the right (but you can choose to exercise it or not at will when it expires) to sell (put option) or buy (call option) some thing (it&#8217;s stock for stock option) at a specific price (stated in the option contract)<br />
- option makes shorting possible</p>
<p>Investing Tips<br />
- Keep It Simple<br />
- Have the Proper Expectations<br />
- Be Prepared to Hold for a Long Time<br />
- Tune Out the Noise<br />
- Behave Like an Owner<br />
- Buy Low, Sell High<br />
- Watch Where You Anchor<br />
- Remember that Economics Usually Trumps Management Competence<br />
- Be Careful of Snakes<br />
- Bear in Mind that Past Trends Often Continue<br />
- Prepare for the Situation to Proceed Faster than You Think<br />
- Expect Surprises to Repeat<br />
- Don&#8217;t Be Stubborn (Stubborn VS Patient)<br />
- Listen to Your Gut<br />
- Know Your Friends, and Your Enemies<br />
- Recognize the Signs of a Top<br />
- Look for Quality<br />
- Don&#8217;t Buy Without Value<br />
- Always Have a Margin of Safety<br />
- Think Independently</p>
<p><strong>One Sentence Summary</strong></p>
<p>- Invest in long term basing on quantitative and qualitative analysis, don&#8217;t speculate if you don&#8217;t want to rely on luck.</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/12/booknotes-morningstars-stock-tutorial/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lecture Notes – Massive-scale online collaboration</title>
		<link>http://csliu.com/2011/12/lecture-notes-massive-scale-online-collaboration/</link>
		<comments>http://csliu.com/2011/12/lecture-notes-massive-scale-online-collaboration/#comments</comments>
		<pubDate>Sat, 10 Dec 2011 05:04:37 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Management]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=467</guid>
		<description><![CDATA[<p>There is a popular presentation on ted Titled as <a href="http://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration.html">massive scale online collaboration</a> given by <a href="http://www.cs.cmu.edu/~biglou/">Luis von Ahn</a>.</p> <p></p> <p>Luis is a well known computer scientist who focuses on so called human computation technologies. He is famous for his previous projects <a href="http://en.wikipedia.org/wiki/CAPTCHA">CAPTCHA</a> and <a href="http://en.wikipedia.org/wiki/ReCAPTCHA">reCAPTCHA</a>. In fact, the word CAPTCHA is coined [...]]]></description>
			<content:encoded><![CDATA[<p>There is a popular presentation on ted Titled as <a href="http://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration.html">massive scale online collaboration</a> given by <a href="http://www.cs.cmu.edu/~biglou/">Luis von Ahn</a>.</p>
<p><object width="526" height="374" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="wmode" value="transparent" /><param name="bgColor" value="#ffffff" /><param name="flashvars" value="vu=http://video.ted.com/talk/stream/2011X/Blank/LuisVonAhn_2011X-320k.mp4&amp;su=http://images.ted.com/images/ted/tedindex/embed-posters/LuisVonAhn_2011X-embed.jpg&amp;vw=512&amp;vh=288&amp;ap=0&amp;ti=1295&amp;lang=&amp;introDuration=15330&amp;adDuration=4000&amp;postAdDuration=830&amp;adKeys=talk=luis_von_ahn_massive_scale_online_collaboration;year=2011;theme=the_rise_of_collaboration;event=TEDxCMU;tag=Technology;tag=collaboration;tag=computers;tag=internet;tag=language;&amp;preAdTag=tconf.ted/embed;tile=1;sz=512x288;" /><param name="src" value="http://video.ted.com/assets/player/swf/EmbedPlayer.swf" /><param name="pluginspace" value="http://www.macromedia.com/go/getflashplayer" /><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><embed width="526" height="374" type="application/x-shockwave-flash" src="http://video.ted.com/assets/player/swf/EmbedPlayer.swf" allowFullScreen="true" allowScriptAccess="always" wmode="transparent" bgColor="#ffffff" flashvars="vu=http://video.ted.com/talk/stream/2011X/Blank/LuisVonAhn_2011X-320k.mp4&amp;su=http://images.ted.com/images/ted/tedindex/embed-posters/LuisVonAhn_2011X-embed.jpg&amp;vw=512&amp;vh=288&amp;ap=0&amp;ti=1295&amp;lang=&amp;introDuration=15330&amp;adDuration=4000&amp;postAdDuration=830&amp;adKeys=talk=luis_von_ahn_massive_scale_online_collaboration;year=2011;theme=the_rise_of_collaboration;event=TEDxCMU;tag=Technology;tag=collaboration;tag=computers;tag=internet;tag=language;&amp;preAdTag=tconf.ted/embed;tile=1;sz=512x288;" pluginspace="http://www.macromedia.com/go/getflashplayer" allowfullscreen="true" allowscriptaccess="always" /></object></p>
<p>Luis is a well known computer scientist who focuses on so called human computation technologies. He is famous for his previous projects <a href="http://en.wikipedia.org/wiki/CAPTCHA">CAPTCHA</a> and <a href="http://en.wikipedia.org/wiki/ReCAPTCHA">reCAPTCHA</a>. In fact, the word CAPTCHA is coined by him for &#8220;<strong>C</strong>ompletely <strong>A</strong>utomated <strong>P</strong>ublic <strong>T</strong>uring test to tell <strong>C</strong>omputers and <strong>H</strong>umans <strong>A</strong>part&#8221;  in the paper: <a href="http://dx.doi.org/10.1007%2F3-540-39200-9_18">CAPTCHA: Using Hard AI Problems for Security </a>.</p>
<p>CAPTCHA is publicly well known since we should already encountered them many times in our daily web life. But reCAPTCHA is not so well known but in fact we should also had faced it many times and this technology is solving some hard AI problems every day.</p>
<p>The motivation behind reCAPTCHA is that, there is about 200M CAPTCHA inputs per day and each input spends a people 10 seconds around. This is really a huge time and intelligence waste, so Luis want to leverage such kind of resource to accomplish some useful work &#8211; solving AI problems that can be divided into 10 seconds small chunks.</p>
<p>Fortunately, there do be one such problem &#8211; book digitizing: scan real books and turn scanned pictures into text. There are already many OCR (optical character recognition) technologies to do this automatically. But they are not good enough, it&#8217;s said that for books older than 50 years ago, OCR can&#8217;t handle +30% of them. So we can divide those OCR task into small pieces (usually, one word per piece) and let people solve them while they are doing CAPTCHA on Internet, which is called reCAPTCHA.</p>
<p>How it works? Each time a CAPTCHA is requested, the system send two pictures to people. One picture contains word that the system already knows but the other are not, and the unknown one comes from books that need to be OCRed. When receiving feedback from human, the system check whether the first picture/text matches, if yes, it has some confidence that the second picture/text also matches. To handle those cases that the second pair failed to match, the system will send the same picture multiple times and use the most popular answer as the final result.</p>
<p>This works very well and the reCAPTCHA is acquired by Google in 2009.</p>
<p>But Luis didn&#8217;t stop there, now he is introducing another great ideas called Duolingo.</p>
<p>The problem Duolingo wants to solve is translate the web into different languages and the challenges for this are:<br />
- Lack of bilinguals<br />
- Lack of motivation</p>
<p>The way to solve this problem is： learning by doing for language learners.</p>
<p>When doing language translation exercises, the learners are given real world sentences that come from the web translation problem. This solution is pretty good because it can solve the web translation problem because there are so many language learner in this world, and it also has positive feedback look to solve the motivation challenge:<br />
- Learn with real content, thus learners has good exercise to improve their skills<br />
- Fair business model for language education, thus learners can learn for free since he had contributed some valuable stuff while learning</p>
<p>Luis called his problem as <a href="http://duolingo.com/">duolingo</a> and I think this project is very promising and super attractive.</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/12/lecture-notes-massive-scale-online-collaboration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>download documents managed by sharepoint using wget</title>
		<link>http://csliu.com/2011/11/download-documents-managed-by-sharepoint-using-wget/</link>
		<comments>http://csliu.com/2011/11/download-documents-managed-by-sharepoint-using-wget/#comments</comments>
		<pubDate>Sat, 05 Nov 2011 11:04:51 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[CsNotes]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=461</guid>
		<description><![CDATA[<p>WGet is a great tool to automate the work of downloading documents when you work in a team whose shared resources are managed by sharepoint system.</p> <p>But there are two common problems:</p> <p>1. Authentication</p> <p>Sharepoint system usually adopt NTLM authentication system, which wget doesn&#8217;t support as of 1.11.4 for windows.</p> <p>The solution is using a [...]]]></description>
			<content:encoded><![CDATA[<p>WGet is a great tool to automate the work of downloading documents when you work in a team whose shared resources are managed by sharepoint system.</p>
<p>But there are two common problems:</p>
<p>1. Authentication</p>
<p>Sharepoint system usually adopt NTLM authentication system, which wget doesn&#8217;t support as of 1.11.4 for windows.</p>
<p>The solution is using a NTLM supported proxy server:<br />
- NTLMaps is a good NTLM enabled proxy server <a href="http://ntlmaps.sourceforge.net/">http://ntlmaps.sourceforge.net/</a><br />
- Edit server.cfg under NTLMaps install folder, each section is self explained<br />
- You need python before using it</p>
<p>To let wget connect to resource using proxy server rather than direct connection, you need to set environment variable <strong>http_proxy</strong> to http://localhost:5865 (the default NTLMaps port).</p>
<p>2. Restrict downloading documents</p>
<p>Sharepoint usually manages large amount of documents for different projects, so you need to restrict the type and folder path of the document you are going to download.</p>
<p>For document type, the &#8211;accept option works, for example: &#8211;accept=docx;pptx;doc;ppt;vsd;txt</p>
<p>For folder path, wget didn&#8217;t support specifying include folder and its subfolder, but an alternative is using -np option instead.</p>
<p>One of my full wget command line looks like:</p>
<p>wget -c -N -nH -r -np &#8211;accept=docx,pptx,doc,ppt,vsd,txt,aspx http://sharepoint/sites/xxx/yyy/xxx</p>
<p>NOTE: even if you just want to download *.ppt file, you also need to specify aspx file as accept file type, because wget needs these files to traverse other page and seek targeted files.</p>
<p>[Reference]<br />
1. http://gnuwin32.sourceforge.net/packages/wget.htm<br />
2. http://ntlmaps.sourceforge.net/</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/11/download-documents-managed-by-sharepoint-using-wget/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>VLDB 2011 Trip Report</title>
		<link>http://csliu.com/2011/09/vldb2011-trip-report/</link>
		<comments>http://csliu.com/2011/09/vldb2011-trip-report/#comments</comments>
		<pubDate>Sun, 11 Sep 2011 03:35:26 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[DistributedSystem]]></category>
		<category><![CDATA[InfoRetrieval]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=450</guid>
		<description><![CDATA[<p>I attended <a href="http://www.vldb.org/2011/">VLDB 2011</a> during 29/08 ~ 01/09 in Seattle. Here are some brief observations and reports for this conference. Due to broad area coverage of VLDB, I just focus on System, Search and Transaction related materials.</p> <p>(Disclaimer: due to long paper list and lacking of strong DB background, it may contain misunderstandings and [...]]]></description>
			<content:encoded><![CDATA[<p>I attended <a href="http://www.vldb.org/2011/">VLDB 2011</a> during 29/08 ~ 01/09 in Seattle. Here are some brief observations and reports for this conference. Due to broad area coverage of VLDB, I just focus on <strong>System</strong>, <strong>Search </strong>and <strong>Transaction </strong>related materials.</p>
<p>(Disclaimer: due to long paper list and lacking of strong DB background, it may contain misunderstandings and personal biases, feel free to follow up and comment.)</p>
<h4><strong>=Basic Info=</strong></h4>
<p>VLDB is one of the top conferences in DB community (others are SIGMOD and ICDE) which focus on managing data and system for data management. For VLDB 2011, there are:</p>
<p>-        30 research sessions with 104 papers presented and 4 industrial sessions with  12 papers presented in 5 parallel tracks<br />
-        8 out of 104 research papers are contributed by Microsoft people<br />
-        31 out of 104 research papers are first authored by Chinese people (Domestic + Oversea)</p>
<p style="padding-left: 30px;">o   Mainland: 4<br />
o   Hong Kong: 4<br />
o   Singapore: 6<br />
o   Oversea: 17</p>
<p>Industrial Participation:<br />
-        Microsoft hosted the reception dinner on 30/08<br />
-        Google/Facebook/EMC had their recruiting/advertising booths at the conference site</p>
<p>Best Paper:<br />
-        <a href="http://www.google.com/search?q=RemusDB%3A+Transparent+High+Availability+for+Database+Systems&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=org.mozilla:en-US:official&amp;client=firefox-a">RemusDB: Transparent High Availability for Database Systems</a></p>
<h4><strong>=Topics and Trends=</strong></h4>
<p>Hot Topics:<br />
-        Graph and Social Data Management: 5 sessions (17 papers), 2 tutorials<br />
-        Big Data Analyzing and Infrastructure: 3+ sessions (12+ papers), 1 tutorial<br />
-        Streaming and Realtime data analyzing</p>
<p>Traditional system topics for DBMS:</p>
<p>-        <strong>Query Processing</strong> session covers:</p>
<p style="padding-left: 30px;">o   Partition the storage and querying processing of native XML database<br />
o   Use a new GroupJoin operator to speed up GroupBy and Join query<br />
o   Optimize Similarity Join using sensitive hashing</p>
<p>-        <strong>Transaction Processing</strong> session covers:</p>
<p style="padding-left: 30px;">o   Scale OLTP system on shared-everything architecture using logical + physical partitioning<br />
o   Recovery algorithm implementation and optimization in DBMS where data management and transactional functionalities are separated<br />
o   New transaction semantic and isolation level definition for cooperated traditional transactions<br />
o   Hyder’s optimistic concurrency control algorithm in fast network and storage settings</p>
<p>It’s amazing that we found there were several (distributed) systems related sessions, where several papers are highly related to some on-going projects in our group:<br />
-        <strong>New Hardware Architecture</strong>, it covers</p>
<p style="padding-left: 30px;">o   Main memory based, column/row hybrid storage engine driven by application trace<br />
o   Compiling query statement directly into binary native code rather than iterator based execution plan<br />
o   Parallel B+ tree algorithm for many core processor</p>
<p>-        <strong>Cloud Computing and High Availability</strong>, it covers</p>
<p style="padding-left: 30px;">o   Database storage live migration<br />
o   High available Database based on reliable Virtual Machine</p>
<p>-        <strong>Distributed System</strong> (2 sessions), it covers</p>
<p style="padding-left: 30px;">o   Selectively partial replicating large-scale web databases<br />
o   Paxos based high available datastore<br />
o   DBMS like indexing on overlay network</p>
<p>-        <strong>MapReduce and Hadoop</strong>, it covers</p>
<p style="padding-left: 30px;">o   Adding data co-location optimization in Hadoop for column-oriented storage application<br />
o   Automatic optimize Hadoop program using code analyzing</p>
<p>-<strong> GPU-based Architectures and Column-Store Indexing</strong>, it covers</p>
<p style="padding-left: 30px;">o   Sparse matrix-vector multiplication by leveraging GPU<br />
o   Transaction execution on GPU<br />
o   List intersection and index compressing using GPU</p>
<p>So 13 out of 104 research papers in VLDB are in system style. It’s also amazing that all system related sessions are crowded with audiences and there are active Q/A after the presentation. While in other sessions that I happened to attend, there are relatively small numbers of attendees and the session is also pretty quiet. System related publication institutes cover CMU, IBM Research, Intel Research and Yahoo!.</p>
<h4><strong>=Notable Papers=</strong></h4>
<p>Here I only focus on system, search and transaction related papers.</p>
<p><strong>-        RemusDB: Transparent High Availability for Database Systems</strong><br />
Umar Farooq Minhas, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Kenneth Salem, Andrew Warfield</p>
<p>This work was rewarded as best paper in VLDB2011 and came from Waterloo University.</p>
<p>The paper proposed the idea of making DBMS high available by leveraging VM HA technology called Remus and doing some DBMS specific performance optimizations for it. The paper first explained why Remus can be used to do DBMS HA without breaking ACID properties and then discussed 4 (3 memory related, 1 network related) DBMS specific optimizations.</p>
<p>To reduce the size of checkpoint synced from active to standby node, they put page content diff, not the whole original page content to checkpoint since most modifications between consecutive checkpoint touch only part of a page.</p>
<p>To avoid checkpointing pages that can be read from disk, they also track disk read operations and put some metadata into checkpoint data and standby server can use these small size metadata to reconstruct those memory pages.</p>
<p>They also implemented an interface to let application developer mark pages not replicated explicitly but didn’t use it in this paper since it has negative performance impact for DBMS software.</p>
<p>The previous optimizations seem not very DBMS specific and are pretty general. Other applications can also benefit from it, so I think it should be optimization work for Remus.</p>
<p>Yet another optimization is DBMS specific: it leverage transaction semantic to avoid Remus’s TCP packet level protecting. In this optimization, the underlying VM only need to protect Transaction related message such as acknowledge to ABORT and COMMIT message from client. This will reduce the latency a lot for irrelevant messages, such as those that comprising the transaction itself.</p>
<p>The ideas seem simple and easy to understand, the results seem very good and the work is done in real world code base: XEN, MySQL and PostgreSQL. These are probably the reasons why it is voted as best paper, although the innovation and technical challenge are not that big in system guy’s eyes.</p>
<p>There are some obvious drawbacks for this work:</p>
<p style="padding-left: 30px;">o   First, it only works with VM, which has some overhead especially for DBMS like applications since it is very I/O sensitive. The paper didn’t mention the overhead of running DBMS inside VM<br />
o   It only compare performance with raw Remus, not with other HA technologies, such as MySQL HA cluster. Building HA DBMS using VM may not be the correct way compared with other alternatives.<br />
o   Remus’s checkpoint technology don’t have knowledge about the transaction running inside it, so the standby server is consistent with active server in system level, but not transaction level. I.E., the latest state of standby server may not be consistent in terms of ACID, so it can’t be used to serve reading requests under specific isolation level.</p>
<p><strong>-        PLP: Page Latch-free Shared everything OLTP</strong><br />
Ippokratis Pandis, Pınar Tozun, Ryan Johnson, Anastasia Ailamaki</p>
<p>This paper aims to improve the salability OLTP system on many core system by combining existing logical (shared everything) and physical (shared nothing) partitioning technologies. The idea seems pretty elegant and the work seems to be very solid in both system’s perspective and DBMS’s perspective.</p>
<p>The meanings of “shared everything” and “shared nothing” in this paper are not the same as in distributed/parallel DBMS settings. They are the technologies used to eliminating the contention bottleneck of OLTP system on many-core platform: the former term refers to the technology of assigning different range of the same shared table to different thread to avoid high level locking among OLTP threads and the later one refers to the technology of partitioning the underlying data pages of one table and assigning each partition to different database instance.</p>
<p>PLP combines these two technologies by a new design called Multi-Rooted B+ Tree:</p>
<p style="padding-left: 30px;">o   Each logical partition has its own sub B+ tree as index, which is similar to shared nothing design<br />
o   The underlying data pages are shared among all logical partitions, which is similar to shared everything design<br />
o   Transaction manager will divide each transaction into a DAG of tasks, each task is within partition boundary and assigned to dedicated thread for that partition</p>
<p>Thus, this new design remains the benefit of contention free among transaction threads, low cost of repartitioning/rebalancing and eliminated the need for distributed transaction for cross-partition transactions.</p>
<p>But the work is based on a research prototype called Shore-MT which is built by WISC/EPFL, if it’s on top of popular open source DBMS such as MySQL or PostgreSQL, the work will be more convincing and making bigger real world impact.</p>
<p><strong>-        Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore</strong><br />
Jun Rao (LinkedIn), Eugene Shekita (IBM Research &#8211; Almaden), Sandeep Tata (IBM Research &#8211; Almaden)</p>
<p>There was a paper talks about using Paxos to build reliable data store in NSDI 2011 and here comes the similar story for DB. But it’s not a transactional storage, just a key/value style structured storage. And also, the system architecture and the protocol are very similar to that of PacificA.</p>
<p>The system, which is called Spinnaker, is a replicated range partitioned reliable structured storage that providing put/get like operations. The master is based on Apache Zookeeper.</p>
<p>The replication protocol is essentially a combination of two phase commit, majority consensus and group commit. It differs with PacificA on that it only requires majority members’ ack before doing the real commit at partition leader node. Follower recovery is simple and straight forward – learning to catch up with leader state. As for leader recovery, it uses Zookeeper to choose the follower that has the highest prepare number as new leader.</p>
<p>And another trivial difference with PacificA is that it allows reading at follower nodes by providing so called time line consistency semantic.</p>
<p>Replication and consistency is always a hot topic in DB conferences, both VLDB and SIGMOD has dedicated session.</p>
<p><strong>-        Column Oriented Storage Techniques for MapReduce</strong><br />
Avrilia Floratou (University of Wisconsin-Madison), Jignesh Patel (University of Wisconsin-Madison), Eugene Shekita (IBM Research &#8211; Almaden), Sandeep Tata (IBM Research &#8211; Almaden)</p>
<p>This paper presented several techniques to build column oriented structured storage and analyze engine on top of Hapood.</p>
<p>One is for storage enhancement:<br />
o   One column is stored as one file in HDFS<br />
o   Multiple related columns’ files are collated together by adding a new data placement policy for HDFS</p>
<p>Second technology is lazy record deserialization. The author argued that modern analytical applications are processing more and more complex data types, deserializing them from byte stream is pretty expensive. But most applications only need to process part of the whole records. So they proposed an idea:<br />
o   Deserialize only small part of a complex object to determine whether the object is needed to be processed<br />
o   Fully deserialize objects that need to be processed</p>
<p>Another optimization is using some specific technology to compress record, for example dictionary based schema to compress text string.</p>
<p>These works seems very trivial and incremental improvement for Hadoop system.</p>
<p><strong>-        CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop</strong><br />
Mohamed Eltabakh (IBM Research &#8211; Almaden), Yuanyuan Tian (IBM Research &#8211; Almaden), Fatma Ozcan (IBM Research &#8211; Almaden), Rainer Gemulla (Max-Planck-Institut für Informatik), Aljoscha Krettek (IBM Germany), John McPherson (IBM Research &#8211; Almaden)</p>
<p>This paper adds a new data placement policy for HDFS in Hadoop and uses it to speed up Join and Sessionized query like log processing tasks.</p>
<p>They observed that many log processing jobs need to process data partitions from different HDFS files, so placing correlated data partitions from different files will speed up the processing since it eliminate many data shuffle and remote I/O.</p>
<p>And also, although it starts from different angle, this paper convers only part of the work in previous paper.</p>
<p><strong>-        Automatic Optimization for MapReduce Programs </strong><br />
Eaman Jahani (University of Michigan), Michael Cafarella (University of Michigan), Christopher Ré (University of Wisconsin-Madison)</p>
<p>This paper proposed the idea of using static code analyzing to improve the performance of unmodified Hadoop jobs.</p>
<p>But this work only uses analyzing result to do storage related optimization:</p>
<p style="padding-left: 30px;">o   Using index to pre-prune useless record for mapping function by selection analyzing<br />
o   Pruning useless field for map/reduce by projection analyzing<br />
o   Application level compression optimization</p>
<p>They didn’t do any query plan wide optimization using code analyzing result, there seems to be a lot of promising future works here.</p>
<p><strong>- Where in the World is My Data?</strong><br />
<em>Sudarshan Kadambi (Bloomberg), Jianjun Chen (Yahoo!), Brian Cooper  (Google), David Lomax (Yahoo!), Raghu Ramakrishnan (Yahoo!), Adam  Silberstein (Yahoo!), Erwin Tam (Yahoo!), Hector Garcia-Molina (Stanford  University)</em></p>
<p>This paper proposed an idea to replicate structured data table at record level rather than traditionally table/partition level. The basic reasoning behind this idea is that: most popular website contains global data, but access at different geographical sites have different focus of the global data. They call this idea as selective replication.</p>
<p>In their design, record replicas are divided into 3 types:<br />
- Master, where write/update operation can be executed, asynchronously notify full replicas about the update/write<br />
- Full, where read operation can be executed<br />
- Stub, where only primary key is stored and R/W operations are forwarded to proper other replica</p>
<p>Given this setting, the paper focus on the placement of the 3 types of replicas and optimize it for bandwidth (forwarding/replicating) savings. They introduced static/dynamic placement policy and defined a language to specify replica placement constraints (for example, total replica, forced full replica sites etc)</p>
<p>The major difference between static and dynamic placement policies are that dynamic method can leverage historical access pattern data to adjust the placement and make better bandwidth cost. To reduce the bookkeeping cost of store/analyze historical access data, the dynamic policy is simplified as: promoting to a full replica when we see a read at a stub replica; demoting to a stub replica when full replica is notified to update but not read for a period of time.</p>
<p>Their experiment shows that in a 10% remote friend setting, bandwidth used can be improved by 2x.</p>
<p>The value of stub is that it can reduce one message round trip and avoid master hot spot in case that the placement is not optimal.</p>
<p>Drawbacks:<br />
- Cross row transaction is not supported<br />
- Require a primary key for each record</p>
<p><strong>-        Fast Set Intersection in Memory</strong><br />
Bolin Ding University of Illinois at UrbanaChampaign, Arnd Christian K¨onig Microsoft Research</p>
<p>This paper described a fast intersection algorithm for in memory set.  The basic idea is: use machine word to encode set elements and use  bitwise-AND to accomplish intersection. The author reported about 3x  performance gain compared with inverted index based set intersection.</p>
<p>The main problems of this work are:<br />
o   it requires complicated and costly preprocessing and dynamically updating the set is also not easy<br />
o   it’s only applicable for in-memory big/small set intersection and the result scale should be small</p>
<p><strong>-        Efficiently Compiling Efficient Query Plan for Modern Hardware</strong><br />
Thomas Neumann (Technische Universität München)</p>
<p>This paper described a new DB query processing architecture that  compiles query statement into machine code directly using LLVM. Current  DB query processing is based on iterator model and the advantage of this  model is the flexibility and pipelining. But the disadvantages are: 1,  it will call next() for each record for each iterator, which results  lots of function calls; 2, usually, the next() function calls are  virtual functions, this makes the function call cost more expensive; 3,  poor code locality for one query execution. So the author tried  compiling query plan directly into machine code and the previous 3  drawbacks are eliminated.</p>
<p>But the problems of this approach are that the compiling cost,  whether is LLVM code optimizer is good enough for DB query and how to  adopt existing query optimization technologies in this method.</p>
<p><strong>-        Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units</strong><br />
Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, Sheng Lin</p>
<p>This work comes from Baidu-Nankai joint lab and it aims speeding up  index encode/decode and serving by leveraging GPU. I am not familiar  with GPU programming, so skip the content here.</p>
<p>Industrial Sessions:</p>
<p>- <strong> Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows</strong></p>
<p style="padding-left: 30px;">o   Yahoo! Reported a tool used to monitor and debug query processing dataflow.</p>
<p>- <strong> Jaql: A Scripting Language for Large Scale Semistructured Data Analysis</strong></p>
<p style="padding-left: 30px;">o   IBM reported a script language over hadoop called Jaql to do large scale semi-structured data analyze which is used in its InfoShpere product.</p>
<p>-        <strong>Tenzing &#8211; A SQL Implementation on the MapReduce Framework</strong></p>
<p style="padding-left: 30px;">o   Google reported their HIVE copycat called Tenzing, which extended standard SQL with support for advance analysis. The paper also contains some MapReduce enhancement.<br />
o   This is probably the hottest paper in VLDB2011 and many famous DBMS gurus are crowded in the meeting room, probably due to the hot debate on Map/Reduce VS Parallel DBMS several years ago.<br />
o   Google adopted many technologies learned from Parallel DBMS and Dryad to improve Map/Reduce in order to build a low latency SQL compatible query analyzing engine, which “partially answered the previous debate” (Google presenter’s words).</p>
<p><strong>-        Citrusleaf: A Real-Time NoSQL DB which Preserves ACID</strong></p>
<p style="padding-left: 30px;">o   A company called Citrusleaf reported their real-time NoSQL DB that supports ACID, which is called Citrusleaf and is also widely used in some of the world’s largest real-time bidding networks.</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/09/vldb2011-trip-report/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Baidu Tieba Architecture</title>
		<link>http://csliu.com/2011/07/baidu-tieba-architecture/</link>
		<comments>http://csliu.com/2011/07/baidu-tieba-architecture/#comments</comments>
		<pubDate>Sun, 31 Jul 2011 09:03:32 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[WebArch]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=445</guid>
		<description><![CDATA[<p>An architect of Baidu, who is being in charge  of the technology of Tieba product, gave a brief introduction on the back end technologies of this famous web application in the June activity of <a href="http://www.infoq.com/cn/zones/baidu-salon/">Baidu Salon</a>.</p> <p>Here are some notes on this speech:</p> Part I &#8211; Application Scale <p>1. Not just simple plain forum, [...]]]></description>
			<content:encoded><![CDATA[<p>An architect of Baidu, who is being in charge  of the technology of Tieba product, gave a brief introduction on the back end technologies of this famous web application in the June activity of <a href="http://www.infoq.com/cn/zones/baidu-salon/">Baidu Salon</a>.</p>
<p>Here are some notes on this speech:</p>
<h4><strong>Part I &#8211; Application Scale</strong></h4>
<p>1. Not just simple plain forum, but also photo/video/gaming<br />
2. Includes front end, storage, anti-spamming, searching and mining<br />
3. Numeric facts<br />
- Bs of topics<br />
- 10Bs of posts<br />
- 10Ms of posts for single hot topic<br />
- Ps of video data<br />
- 100K+ QPS from client web browser<br />
- 10K+ per second update message forwarding (I doubt this number)<br />
- 100s service</p>
<h4><strong>Part II &#8211; Backend Technology: lightweight framework </strong></h4>
<p>For 80% common situations</p>
<p>1 MySQL<br />
- prefer InnoDB than MyIASM with some modification (on disk writing pattern, with 10x perf gain)<br />
- application optimization:<br />
* avoid joining (by break normalization?)<br />
* auxiliary index<br />
* data locality<br />
- single node numbers<br />
* Ks of QPS<br />
* 100Gs of Data<br />
- mySql clustering<br />
* master/slave for write/read separation<br />
* home brewed request dispatcher (for easier programming and load balancing)</p>
<p>2 Cache<br />
- hit ratio around 80%<br />
- 10k ~ 100k QPS<br />
- multiple granularity (page, picture, data item etc.)<br />
- challenge: cache updating, writing request pressure</p>
<p>3 Flash Disk<br />
- 5x &#8211; 10x perf gain without extra effort<br />
- huge improve on random access<br />
- size limitation: 500G (SSD) vs 10T (HDD)</p>
<h4><strong>Part III &#8211; Backend Technology: Heavyweight Infrastructure </strong></h4>
<p>For 20% rare scenarios</p>
<p>1. Partitioning<br />
- Virtual, partitioned by application<br />
** topic and post are separated<br />
** relationship(list) and content are separated<br />
- Horizontal, partitioned by key</p>
<p>2. Message Queue<br />
- Reliable multicast communication system<br />
- Handling mutation requests (I guess)<br />
- Peak tps:100K+ (really?)<br />
- It can only solve updating reliability problem, but seems that the speaker claims it also solves the scalability problem</p>
<p>3. In house storage node<br />
- speedup by transforming random write to batch/append write<br />
- memory patch: (background merge [mem + disk] in my understanding)<br />
- write ahead logging for reliability<br />
- highly optimized for application</p>
<p>4. In house distributed KV store<br />
- for video storage<br />
- replication (driven by MQ) for reliability<br />
- append only<br />
- Peta bytes scale</p>
<h4><strong>Part IV Backend Technology &#8211; Clustering Management</strong></h4>
<p>1. Most website are basically SOA architecture<br />
- 100+ standalone small services<br />
- service orchestration for single user request</p>
<p>2. Challenges in this architecture<br />
- service/data upgrading<br />
- failure handling<br />
- performance variation</p>
<p>3. Service Management<br />
- service metadata center management<br />
- service registration and notification<br />
- hide service cluster from application caller<br />
- auto failure handling and load balancing<br />
- (why service notification but not just try and ask registry if failed?)</p>
<h4><strong>Part V &#8211; Summary</strong></h4>
<p>It seems that there is nothing new in the presentation, all related technologies are well known. But its value lies in the fact that it gave us a high level overview of how today&#8217;s various famous web service is implemented and many numeric facts about this product.</p>
<h4><strong>[Reference]</strong></h4>
<p>0. <a href="http://www.infoq.com/cn/zones/baidu-salon/">Baidu Salon site</a><br />
1. <a href="http://www.infoq.com/cn/presentations/lh-baidu-tieba-architecture-practice">Speaker introduction and video</a><br />
2. <a href="http://www.infoq.com/pdfdownload.action?filename=presentations-ch%2Fbaidu-salon-20110625-lihan.pdf">Speech ppt </a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/07/baidu-tieba-architecture/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Google Plus – the Inside Out</title>
		<link>http://csliu.com/2011/07/google-plus-the-inside-out/</link>
		<comments>http://csliu.com/2011/07/google-plus-the-inside-out/#comments</comments>
		<pubDate>Wed, 13 Jul 2011 16:32:04 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Business]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=395</guid>
		<description><![CDATA[<p>The Wired magazine recently published <a href="http://www.wired.com/epicenter/2011/06/inside-google-plus-social/all/">a great story on the origination and development of Google+</a>. The author has many inside information about Google and the Google+ product, so the story contains many useful and insightful information about　Google&#8217;s people-centric movement. I noted some of my understanding and comments here:</p> 1. What&#8217;s Google+ ? <p>Basically, Google+ [...]]]></description>
			<content:encoded><![CDATA[<p>The Wired magazine recently published <a href="http://www.wired.com/epicenter/2011/06/inside-google-plus-social/all/">a great story on the origination and development of Google+</a>.  The author has many inside information about Google and the Google+  product, so the story contains many useful and insightful information  about　Google&#8217;s people-centric movement. I noted some of my understanding  and comments here:</p>
<h3><strong><strong>1. What&#8217;s Google+ ?</strong></strong></h3>
<p>Basically,  Google+ is Google&#8217;s social networking initiative to turn this  algorithm-centric giant to be more people-centric. Currently, it  consists of the following major components:<br />
- <strong>Stream</strong>: Stuffs that are shared by people you care about (circled by you in G+ world), very similar to what twitter/facebook provide.<br />
- <strong>Spark</strong>: Stuffs that pushed to you by Google according to your specified interesting.<br />
- <strong>Hangout</strong>: Web based multi-user video chat service.<br />
- <strong>Circle</strong>: A multi-dimensional way to organize your online social networks.</p>
<p>But this is just the very basic introduction, Google+ is more than just those even in today&#8217;s service. More detail follows.</p>
<h3><strong>2. Why Google+ ?</strong></h3>
<p>The Google+ is big product (or product umbrella) mainly driven by <strong><a href="http://en.wikipedia.org/wiki/Vic_Gundotra">Vic Gundotra</a></strong>, SVP of Google Social Division, a former general manager at Microsoft in charging .Net/Live developer ecosystem.</p>
<p>The major driving forces of Google&#8217;s social efforts come from:</p>
<p>- <strong>Challenges</strong> from other pioneers such as Facebook. Facebook refused to open its  content and connection data to Google while it gets more and more  popular. People in Google worry that Facebook may use those valuable  user contributed data to build a even better people-centric search  engine that beats Google.</p>
<p>- <strong>Internet paradigm shift</strong>. The Internet and application in it become more and more people centric, which is not the same as when Google&#8217;s founded:</p>
<blockquote><p>&#8220;The internet is nothing but<strong> software fabric that connects the  interactions of human beings</strong>, every piece of software  is going to transformed by this primacy of people and this shift.&#8221;  -Gundotra, SVP of Google social</p></blockquote>
<h3><strong><strong>3. The History of Google&#8217;s Social Efforts<br />
</strong></strong></h3>
<p>January, 2004, Google launched it&#8217;s social networking service &#8211; <a href="http://www.orkut.com">Orkut</a>, developed as spare time project by <a href="http://en.wikipedia.org/wiki/Orkut_B%C3%BCy%C3%BCkk%C3%B6kten">Orkut Büyükkökten</a> while working at Google.</p>
<p>2007, Google start a initiative called Open Social to establish a open standard for social applications and platforms.</p>
<p>2009, a social networking based communication tool called Wave was introduced during Google I/O.</p>
<p>2009, a twitter like product called Buzz is integrated into Gmail.</p>
<p>Non of them had been considered as a successful product, but Google&#8217;s social networking efforts continues.</p>
<p>March  2010, only a month after the Buzz debacle, Google’s head of operations,  Urs Hölzle, sent out  an e-mail evoking Bill Gates’s <a href="http://www.scribd.com/doc/881657/The-Internet-Tidal-Wave">legendary 1995 Internet Tidal Wave</a> missive to Microsofties. Hölzle acknowledged that fundamental way people use the internet has changed. He did started some social networking related projects within Google and his memo became known as the Urs-Quake.</p>
<p>May  2010, 50 of Google’s top people gathered together to discuss the  challenges faced by the search giant. Amit Singhal, one of the company’s  most respected search engineers, urged that Google dramatically expand  its focus to create a hub of personalization and social activity.</p>
<p>The Google leadership team adopted Singhal&#8217;s suggestion and code named the projects as: <strong>Emerald Sea</strong>.  Gundotra made a pitch to lead the Emerald Sea project, and got the nod.  Bradley Horowitz became his co-leader and collaborator.</p>
<div id="attachment_37312"><a href="http://www.wired.com/images_blogs/epicenter/2011/06/bradly_vic.jpg"><img class="aligncenter" title="bradly_vic" src="http://www.wired.com/images_blogs/epicenter/2011/06/bradly_vic.jpg" alt="" width="660" height="440" /></a>
<p style="text-align: center;">Google VP of product management Bradley Horowitz (L) and Vic Gundotra, Senior vice president of social for Google. (from [1])</p>
</div>
<h3><strong><strong><strong><strong>4. The Birth of Google+ </strong></strong></strong></strong></h3>
<p>- It got started just after the May meeting, and covered 18 current Google products, with almost 30 teams working in concert.</p>
<p>- It produced a working prototype 100 days after the May meeting (August 2010).</p>
<p>- It became ready for dogfood around October 2010.</p>
<p>- It got its first 50 users by email invitation, 600+ in around one hour, 90% of Google employe within one day during dogfood.</p>
<p>-  The first round of dogfood feedback is not very positive due to lacking  of tutorial and feature complication &#8211; hard to comprehend and hard to  use .</p>
<p>- It is refactored and re-conceptualized according to  feedback: some features are delayed to future release, some are  separated out as other standalone features, such as the <a href="http://www.google.com/+1/button/">+1 button</a>.</p>
<p>- It rolled out the second round dogfood with selected people within Google in Spring, 2011 and got positive feedback.</p>
<p>- It started its field test @ June 28, 2011, where external users can experience this product in invite-only way.</p>
<h3><strong><strong><strong><strong><strong><strong><strong><strong>5. Feature Drill down and Insights<br />
</strong></strong></strong></strong></strong></strong></strong></strong></h3>
<p><strong>Stream</strong> &#8211; ordered shared items from you social graph. It&#8217;s a pretty typical social networking feature that is provided by twitter, facebook and weibo. But it has its uniqueness:<br />
- It has no limitation on the word count of item, while Twitter/Weibo limits it to 140<br />
- It has +1 button and can be commented with instant update to online readers<br />
- It can be filtered by author groups, which is a very handy feature when you follow large amount of people</p>
<p><strong>Spark</strong> &#8211; streamed items from Google according to the topics you explicitly specified. Sounds like a normal search query result page but Google had adjusted the filtering and ranking policy to make it more suitable for sharing in Google+ world. It favors more on fresh, social popular and visual items discovered from the web.</p>
<p>spark is the way Google try to understand your unique interests and feed you with related information. But it may also be the cover that hide the facts that Google is using the privacy related information from Gmail content and your search history to know more about your interests.</p>
<p><strong>Circle/Sharing</strong> &#8211; offers a simple means of organizing one’s social network so that your sharing is micro-targeted: you organize your social network into various (maybe overlapped) circles and share items to specific circles. It may be the most important and also most controversial feature in Google+.</p>
<p>Some people said that it help them control who will see shared items but others said that it makes sharing action very complicated and the whole social network become very hard to manage and understand.</p>
<p>In my personal experience, it&#8217;s a over designed feature. I am forced to think/select what&#8217;s the target audience when I want to share something online, which break the famous UX design rule: DON&#8217;T MAKE ME THINK. And also, it&#8217;s very hard for a user to understand thoroughly exactly who will ultimately see the item I am going to share.</p>
<p>Here is the exact Google+ share visibility workflow: (sorry, it&#8217;s in Chinese, but you can get the idea)</p>
<p style="text-align: center;"><img src="https://lh5.googleusercontent.com/-8S9y7e5rOLg/ThRxZ7h2URI/AAAAAAAADso/0jWy19ZWcJU/s640/222.jpg" alt="https://lh5.googleusercontent.com/-8S9y7e5rOLg/ThRxZ7h2URI/AAAAAAAADso/0jWy19ZWcJU/s640/222.jpg" /></p>
<p style="text-align: center;">(digram from <a href="https://plus.google.com/112851096191264867748/posts/KuGvGPSFJ1U">william feng</a>)</p>
<p>How many people on this planet has enough patient to fully understand this logic and exercise it each time when he want to share an interesting item?</p>
<p>The idea of circle and multiple social network is said to be the result of the following research result:</p>
<div style="width:477px;margin-left:auto;margin-right:auto;" id="__ss_4656436"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/padday/the-real-life-social-network-v2" title="The Real Life Social Network v2" target="_blank">The Real Life Social Network v2</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/4656436" width="477" height="510" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">documents</a> from <a href="http://www.slideshare.net/padday" target="_blank">Paul Adams</a> </div>
</p></div>
<p>Google claims that it create the idea and concept of circle because it behaves exactly the same way as our real social experience. Let&#8217;s assume that it does behave exactly as real social activity, but will it better to behave the same as reality? I don&#8217;t think so. We spend more and more time on online social activities because it&#8217;s different (in positive way) from the boring real society. For me, I use various online social service because it&#8217;s more convenient for me to keep in touch with real friends and it&#8217;s more open and easier for me to get know more friends, especially those that aren&#8217;t available in real life. If the online society is the same as the real one, what&#8217;s the attractiveness of the online social service? I feel Google&#8217;s circle concept is making the online social more enclosed, more complicated to understand and master.</p>
<p>There are some other critics said &#8220;SNS just do what virtual world should do, let some other stuff happen in real world&#8221; and &#8220;in real life, the   circle is not chosen when you want to convey some message, rather, you   choose what to say when you are in different situation and different   circle&#8221;.</p>
<p>I do admit that there are some situations that I didn&#8217;t want my message to be visible to some one in my social network. But it&#8217;s better to be fulfilled by a feature: selecting what&#8217;s the target user you want to hide your message/status from, I.E., <strong>you need to do minus rather than addition</strong>. Here, the minus operation is easier to understand and involves less thinking.</p>
<p>But circle is a good idea for streaming stuff filtering especially when you follow many people and they have different message updating cycle.</p>
<h3><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong><strong>6. Misc</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></strong></h3>
<p>- “There  are only a few emotions that can effect change at a large   organization,” he (Gundotra) explains. “<strong>One is greed and another powerful one is   fear</strong>.” Outright greed is gauche in the Googleplex, so Gundotra prepared a   slide deck that mocked up challenges from Google’s competitors   (notably, Facebook), illustrating how each company could turn Google   upside down.</p>
<p>- Emerald Sea has been the rare initiative in  Google where the company was  not breaking ground but defensively  <strong>responding to a competitor’s  success</strong>. (One engineer has described this  process as “chasing  taillights,” noting that me-too-ism has never been a  strength for  Google.) It’s also, claims Gundotra, the most extensive  companywide  initiative in Google’s history.</p>
<p>- “We put  the product to [dog food] before it was fully baked, before we  hardened  the system and polished it and knew what we were doing,” says   Horowitz. “We had no getting-started screen, no intro video. It was <strong>hard   for people to get their hands around</strong> what it is and how to begin   interacting with it. It was as if Facebook had been in stealth mode for   seven years and then launched in its entirety at once today — it would   have been an overwhelming, hard-to-comprehend, hard-to-understand   system. The feedback we was got was: <strong>Simplify</strong>.”</p>
<p>- No  one expects an instant success. But even if this week’s launch  evokes  snark or yawns, Google will keep at it. Google+ is not a product  like  Buzz or Wave where the company’s leaders can chalk off a failure to   laudable ambition and then move on. “We’re in this for the long run,”   says Ben-Yair. “This isn’t like an experiment. <strong>We’re betting on this</strong>, so   if obstacles arise, we’ll adapt.”</p>
<p>- Because of the pressure the stakes and the scale, Gundota insisted that  Emerald Sea should be an exception to Google’s usual <strong>consensus-based  management style</strong>.</p>
<p>- “This is a <strong>top-down mandate where a clear vision is set out</strong>, and then  the mode of moving forward is that you answer to Vic,” Rick Klau told me  last year. “If Vic says ‘That looks good,’ then it looks good.”</p>
<h3><strong>[Reference]</strong></h3>
<p>[1] <a href="http://www.wired.com/epicenter/2011/06/inside-google-plus-social/all/1">Inside Google+ — How the Search Giant Plans to Go Social</a><br />
[2] <a href="http://www.slideshare.net/padday/the-real-life-social-network-v2">The Real Life Social Network V2</a><br />
[3] <a href= "http://en.wikipedia.org/wiki/Google%2B">Google Plus Wiki</a><br />
[4] <a href="http://www.marketingprofessor.com/social-marketing/40-google-plus-tips-for-newbies/">40 Google Plus Tips</a><br />
[5] <a href="http://www.findallanswers.com/google-plus-tutorial-how-to-make-your-stay-pleasant-and-useful/">Google Plus Tutorial</a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/07/google-plus-the-inside-out/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cross the Wall using SSH</title>
		<link>http://csliu.com/2011/07/cross-the-wall-using-ssh/</link>
		<comments>http://csliu.com/2011/07/cross-the-wall-using-ssh/#comments</comments>
		<pubDate>Sun, 03 Jul 2011 16:30:33 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[CsNotes]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=115</guid>
		<description><![CDATA[<p>There is a well known wall called GFW (功夫网) in Chinese Internet and there are also many ways to cross over it. Private proxying using SSH may be the most reliable, available and cheapest way to accomplish this crossing over.</p> <p>&#160;</p> <p>1. How to get SSH service?</p> <p>- Register and web hosting or virtual private [...]]]></description>
			<content:encoded><![CDATA[<p>There is a well known wall called GFW (功夫网) in Chinese Internet and there are also many ways to cross over it. Private proxying using SSH may be the most reliable, available and cheapest way to accomplish this crossing over.</p>
<p>&nbsp;</p>
<p>1. How to get SSH service?</p>
<p>- Register and web hosting or virtual private server service from outside China mainland</p>
<p>- For web hosting, I recommend <a href="http://bluehost.com">bluehost</a>, for vps, I recommend <a href="http://linode.com">linode</a></p>
<p>&nbsp;</p>
<p>2. How to proxying over SSH</p>
<p>- Create sock5 proxy over SSH using <a href="http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html">plink</a> or <a href="http://www.bitvise.com/tunnelier">Bitvise Tunnelier </a></p>
<p>- Config you browser wisely. I recommend: <a href="http://www.mozilla.com/">FireFox</a> + <a href="https://addons.mozilla.org/en-US/firefox/addon/autoproxy/ ">AutoProxy</a></p>
<p>&nbsp;</p>
<p>Now make a deep breath and feel the fresh air of freedom!</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/07/cross-the-wall-using-ssh/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Things Learned from “from big idea to thriving business in 8 short years”</title>
		<link>http://csliu.com/2011/06/things-learned-from-from-big-idea-to-thriving-business-in-8-short-years/</link>
		<comments>http://csliu.com/2011/06/things-learned-from-from-big-idea-to-thriving-business-in-8-short-years/#comments</comments>
		<pubDate>Tue, 14 Jun 2011 16:37:43 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[Management]]></category>

		<guid isPermaLink="false">http://blog.csliu.com/?p=430</guid>
		<description><![CDATA[<p>Just read a great story from a programmer on building his own business little by little: <a href="http://blog.traysoft.com/2011/04/my_startup_story/">My Startup Story: from Big idea to Thriving Business in 8 Short Years</a>. Its greatness lies on not how big his business is, but on how his business gets bigger and bigger and on how he deal with [...]]]></description>
			<content:encoded><![CDATA[<p>Just read a great story from a programmer on building his own business little by little: <a href="http://blog.traysoft.com/2011/04/my_startup_story/">My Startup Story: from Big idea to Thriving Business in 8 Short Years</a>. Its greatness lies on not how big his business is, but on how his business gets bigger and bigger and on how he deal with various problems encountered in the long journey. Here are some of the lessons I learned from reading this story:</p>
<p>1. dream big but execute step by step;</p>
<p>2. make plan but adopt adjustment  agilely;</p>
<p>3. solve real problems from users;</p>
<p>4. care about user  feedback;</p>
<p>5. passion for technology;</p>
<p>6. continuously improving;</p>
<p>7. have  business sense;</p>
<p>8. listen from others;</p>
<p>9. think beyond today and be aware of future  crisis;</p>
<p>10. forget about yesterday&#8217;s success and dare to restart again;</p>
<p>11. retrospect on existing products and dig deeper;</p>
<p>12. build ecosystem</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/06/things-learned-from-from-big-idea-to-thriving-business-in-8-short-years/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Memory Issues on Multicore Platform</title>
		<link>http://csliu.com/2011/02/memory-issues-on-multicore-platform/</link>
		<comments>http://csliu.com/2011/02/memory-issues-on-multicore-platform/#comments</comments>
		<pubDate>Sat, 26 Feb 2011 11:18:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[System]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=111</guid>
		<description><![CDATA[<p>On multi-core platform, pure computing is cheap since there are many processing unit and memory capacity may also not be problem since it&#8217;s becoming larger and larger. But memory bandwidth remains the bottleneck all the time because it&#8217;s a bus that is shared by all CPU cores. So efficient memory management is very critical for [...]]]></description>
			<content:encoded><![CDATA[<p>On multi-core platform, pure computing is cheap since there are many  processing unit and memory capacity may also not be problem since it&#8217;s  becoming larger and larger. But <strong>memory bandwidth</strong> remains the  bottleneck all the time because it&#8217;s a bus that is shared by all CPU  cores. So efficient memory management is very critical for a scalable  application on multicore CPU.</p>
<p>In this article I will point out some memory related problems regarding multicore architecture and also some solutions.</p>
<p><span style="font-size: large;"><strong>Part I &#8211; Memory Contention</strong></span></p>
<p>Memory  Contention means that different cores share a common data region(in  main memory and cache) that needs to be synchronized among them.  Synchronizing data among different cores has big performance penalty  because bus traffic contention, locking cost and cache miss. To deal  with such problem, there are two strategies:</p>
<p><strong>1. Don&#8217;t Share Writable State Among Cores</strong></p>
<p>To minimize memory bus traffic, you should minimize core interactions by <strong>minimizing shared locations/data</strong>, even if the shared data is not protected by lock but some hardware level atomic instructions such as <strong>InterlockedExchangeAdd64</strong> on win32 platform.</p>
<p>The  patterns that tend to reduce lock contention also tend  to reduce  memory traffic, because it is the shared writable state that requires   locks and generates contention. In practice, letting each thread work on  its own local copy of the data and merging  the data after all threads  are done can be a very effective strategy.</p>
<p>Let&#8217;s see <a href="http://code.google.com/p/code4cs/source/browse/trunk/Tools/MemPerfor/ParaSum.cxx">two parallel versions of sum calculation program</a> on an eight-core computer. One version uses a shared global variable protected by <strong>InterlockedExchangeAdd64()</strong> to track all intermediate results among all threads. The other version  gives each thread a private partial sum variable that&#8217;s not shared at  all and the final sum is computed as the sum of all these partial sums.</p>
<p>From the console output we can see clearly that, the private partial sum solution is 20x faster than the other one.<strong><br />
</strong></p>
<div style="font-family: inherit;">Use Global &#8211; Total Sum is:49999995000000, used ticket:<strong>904</strong>.</div>
<p><span style="font-family: inherit;">Use Local &#8211; Total Sum is:49999995000000, used ticket:</span><strong style="font-family: inherit;">47</strong><span style="font-family: inherit;">. </span></p>
<p>So,  even if we just share one variable protected by hardware atomic  instructions, the performance penalty could be very significant.</p>
<p>The  general rule for efficient execution on a single core is to pack  data  tightly, so that it has as small a footprint as possible. But on a   multi-core processor, packing shared data can lead to a severe penalty   from false sharing. Generally, the solution is to pack data tightly,   give each thread its own private copy to work on, and merge results   afterwards.</p>
<p><strong>2. Avoid False Sharing introduced by Core Cache</strong></p>
<p>Good  performance depends on processors fetching most of their data from   cache instead of main memory. For sequential programs, modern caches   generally work well without too much thought, though a little tuning   helps. The smallest unit of memory that two processors interchange is a  cache line or cache sector.</p>
<p>Even if we follows the  strategy 1 and let each thread access its private data/state, different  thread on different cores may also share the same cache line. This is  called &#8220;<strong>false sharing</strong>&#8220;. Avoiding false sharing may require aligning variables or objects in memory on cache line boundaries.</p>
<p>Let&#8217;s use <a href="http://code.google.com/p/code4cs/source/browse/trunk/Tools/MemPerfor/CachePerf.cpp">a parallel number increaser</a> to see what&#8217;s the performance penalty of false sharing. In the first  version, each thread will modify some thread specific number variables,  which are aligned together (so will be packed in the same cache line).  In the second version, those variables are located on non-continuous  places.</p>
<p>The performance related number would be:<br />
Total Time:<strong>2012</strong> (for first version)<br />
Total Time:<strong>468</strong> (for second version)</p>
<p>We  can see that, false sharing introduced about 5x performance penalty.  Avoiding false sharing may require aligning variables or objects in  memory on cache line boundaries, so that each core accesses a private  cache line that is not shared with others.<br />
<span style="font-size: large;"><strong><br />
</strong></span><br />
<span style="font-size: large;"><strong>Part II &#8211; Heap Contention</strong></span></p>
<p>Most  developers manage memories using standard C library malloc/free or  standard C++ library new/delete, some of them using OS APIs, for  example, <strong>HeapAlloc</strong>()/<strong>HeapFree</strong>() on windows platform.</p>
<p>C/C++  standard memory management routines are implemented using platform  specific memory management APIs, usually based on the concept of <strong>Heap</strong>.  These library routines (whether is single thread version or  multi-thread version) allocate/free memory resource on a single heap,  which is usually called CRT heap. It&#8217;s a global resource that is shared  and contended among threads within a process.</p>
<p>This <strong>heap contention</strong> is one of the bottle neck of multi-threading applications that are  memory intensive. The solution is to use thread local/private heap to do  memory management, thus the resource contention is eliminated. On  windows platform, this means that you need to create a dedicated heap  using <strong>HeapCreate</strong>() for each thread and pass the returned heap handle to <strong>HeapAlloc</strong>()/<strong>HeapFree</strong>() functions.</p>
<p>Let&#8217;s see this <a href="http://code.google.com/p/code4cs/source/browse/trunk/Tools/MemPerfor/WinHeap.cxx">Global Heap Vs Local Heap example on Windows platform</a></p>
<p>On an 8-core system, perf test result using 8 threads are:<br />
8 core time:<strong>59282</strong>, use global heap?<strong> true</strong>.<br />
8 core time:<strong>20112</strong>, use global heap? <strong>false</strong>.</p>
<p>Using private heap will get around 3x perf gain.</p>
<p>NOTE:<br />
- On windows platform, <strong>heap_no_serialization</strong> flag can be set when creating a heap, this means that there will be no  synchronization cost when accessing it from multiple threads. But it  turns out that setting this flag to thread private heap will be very  slow on vista and later operating system.<br />
- The reason is that in  vista, Microsoft refactored the heap manager code, where some extra data  structure and code are removed who is no longer part of the common case  for handling heap API calls.<br />
- Heap_no_serialization and some  debug scenarios will disable Low Fragment Heap feature, who is now the  de facto default policy for heaps and thus highly optimized.</p>
<p><span style="font-size: large;"><strong>Part III &#8211; Dynamic Creation/Free of C++ Object</strong></span></p>
<p><strong>Operator New/Delete</strong> are <strong>functions</strong>, which are the C++ version of malloc/free and responsible for create/release memory only. It has global version <strong>::operator new</strong> and class level version (static member) <strong>class-name::operator new</strong>.</p>
<p>But <strong>New/Delete Operator</strong> will handle object construction and deconstruction besides memory  management. It&#8217;s a language operator just like +, &#8211; * / and others.  New/Delete operator will call global operator new/delete or class  specific operator new/delete if requested class has such operator  functions.</p>
<p>In order to fully parallelize your  application that may use some STL containers, you might need to write  your own allocator to leverage thread private heap or some memory pools.  Thus, your business logic is the same as single core version and  contention bottle neck is eliminated at the same time.</p>
<p>Here is the example on <a href="http://code.google.com/p/code4cs/source/browse/trunk/Tools/MemPerfor/CppAlloc.cpp">writing your own operator new/delete and allocator</a>.</p>
<p><strong>[Reference]</strong></p>
<p>Hehalem Architecture<br />
<a href="http://arstechnica.com/hardware/news/2008/04/what-you-need-to-know-about-nehalem.ars">http://arstechnica.com/hardware/news/2008/04/what-you-need-to-know-about-nehalem.ars</a><br />
<a href="http://rolfed.com/nehalem/nehalemPaper.pdf">http://rolfed.com/nehalem/nehalemPaper.pdf</a></p>
<p>Cache Organization and Memory Management of the Intel Nehalem Computer Architecture<br />
<a href="http://rolfed.com/nehalem/nehalemPaper.pdf">http://rolfed.com/nehalem/nehalemPaper.pdf</a></p>
<p>Cross-Platform Get Cache Line Size<br />
<a href="http://strupat.ca/2010/10/cross-platform-function-to-get-the-line-size-of-your-cache/">http://strupat.ca/2010/10/cross-platform-function-to-get-the-line-size-of-your-cache/</a></p>
<p>Understanding and Avoiding Memory Issues with Multi-core Processors<br />
<a href="http://www.drdobbs.com/high-performance-computing/212400410">http://www.drdobbs.com/high-performance-computing/212400410</a></p>
<p>Thread/Data placement for better/consistent performance on Multi-Core/NUMA Achitecture<br />
<a href="http://www.renci.org/wp-content/pub/techreports/TR-08-07.pdf">http://www.renci.org/wp-content/pub/techreports/TR-08-07.pdf</a></p>
<p>Parallel Memory Management(Allocate/Free) Intensive Applications on Multi-core system<br />
English Version &#8211; <a href="http://www.codeproject.com/KB/cpp/rtl_scaling.aspx">http://www.codeproject.com/KB/cpp/rtl_scaling.aspx</a><br />
Chinese Version &#8211; <a href="http://blog.csdn.net/arau_sh/archive/2010/02/22/5317919.aspx">http://blog.csdn.net/arau_sh/archive/2010/02/22/5317919.aspx</a></p>
<p>Intel Guide for Developing Multithreaded Applications<br />
<a href="http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications/">http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications/</a></p>
<p>Windows Heap Management/Performance<br />
<a href="http://stackoverflow.com/questions/1983563/reason-for-100x-slowdown-with-heap-memory-functions-using-heap-no-serialize-on-v">http://stackoverflow.com/questions/1983563/reason-for-100x-slowdown-with-heap-memory-functions-using-heap-no-serialize-on-v</a><br />
<a href="http://www.blackhat.com/presentations/bh-usa-06/BH-US-06-Marinescu.pdf">http://www.blackhat.com/presentations/bh-usa-06/BH-US-06-Marinescu.pdf</a><br />
<a href="http://www.codeproject.com/KB/winsdk/HeapPerf.aspx">http://www.codeproject.com/KB/winsdk/HeapPerf.aspx</a><br />
<a href="http://blogs.msdn.com/b/oldnewthing/archive/2010/04/29/10004218.aspx">http://blogs.msdn.com/b/oldnewthing/archive/2010/04/29/10004218.aspx</a></p>
<p>Memory Optimization for the entire C++ program<br />
<a href="http://www.cantrip.org/wave12.html">http://www.cantrip.org/wave12.html</a></p>
<p>C++ Dynamic Memory Management Techniques<br />
<a href="http://www.cs.wustl.edu/%7Eschmidt/PDF/C++-mem-mgnt4.pdf">http://www.cs.wustl.edu/~schmidt/PDF/C++-mem-mgnt4.pdf</a></p>
<p>Understanding Operator New and Operator Delete<br />
<a href="http://www.codeproject.com/KB/cpp/Memory_Management.aspx">http://www.codeproject.com/KB/cpp/Memory_Management.aspx</a></p>
<p>C++ Standard Allocator &#8211; Introduction and Implementation<br />
<a href="http://www.codeproject.com/KB/cpp/allocator.aspx">http://www.codeproject.com/KB/cpp/allocator.aspx</a><br />
<a href="http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c4079">http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c4079</a></p>
<p>Improve Performance by Allocator using Pooled Memory<br />
<a href="http://www.drdobbs.com/cpp/184406243">http://www.drdobbs.com/cpp/184406243</a></p>
<p>Improving STL Allocators<br />
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html</a></p>
<p>Anatomy of the Linux slab allocator<br />
<a href="http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator/">http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/02/memory-issues-on-multicore-platform/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tips for Smart Pointers in C++</title>
		<link>http://csliu.com/2011/01/tips-for-smart-pointers-in-c/</link>
		<comments>http://csliu.com/2011/01/tips-for-smart-pointers-in-c/#comments</comments>
		<pubDate>Sun, 16 Jan 2011 13:17:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Engineering]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=110</guid>
		<description><![CDATA[<p>Part I &#8211; Brief Summary for Various Smart Pointers </p> <p>1. auto_ptr<br />- RAII and transfer-of-ownership semantics based, but no shared-ownership<br />- Managed heap object will be owned by one and only one<br />- Assignment/Copy Construction will transfer ownership<br />- Can be compiled with STL containers, but wrong semantic</p> <p>2. scoped_ptr<br />- RAII semantic based, [...]]]></description>
			<content:encoded><![CDATA[<p><b><span style="font-size: large;">Part I &#8211; Brief Summary for Various Smart Pointers </span></b></p>
<p><b>1. auto_ptr</b><br />- RAII and transfer-of-ownership semantics based, but no shared-ownership<br />- Managed heap object will be owned by one and only one<br />- Assignment/Copy Construction will transfer ownership<br />- Can be compiled with STL containers, but wrong semantic</p>
<p><b>2. scoped_ptr</b><br />- RAII semantic based, but no shared-ownership, nor transfer-of-ownership semantics<br />- Managed heap object will be owned by one and only one pointer<br />- Assignment/Copy Construction are forbidden<br />- Can&#8217;t be compiled with STL containers </p>
<p><b>3. shared_ptr</b><br />- Reference count based<br />- Managed heap object could be owned by multiple smart pointers<br />- Assignment/Copy Construction will add ownership<br />- To avoid memory leak, don&#8217;t construct temporary shared_ptr object on function call parameter<br />- Can&#8217;t construct a shared_ptr object from this pointer (Causes double deletion)<b>&nbsp;</b></p>
<p><b>4. intrusive_ptr </b><br />- Basically the same as shared_ptr<br />- Shared ownership of objects with an embedded reference count<br />- Can be constructed from an arbitrary      raw pointer of type <b>T *</b><br /><b>- </b>Try <b>shared_ptr</b> first, if&nbsp; it isn&#8217;t obvious whether <b>intrusive_ptr</b> better     fits your needs</p>
<p><b>5. weak_ptr</b><br />- Just reference, no ownership, no RAII, no shared-ownership, no transfer of ownership<br />- Linked to a shared_ptr object and known by it<br />- Shared_ptr will reset weak_ptr when it decides to destroy the dynamic object owned by it<br />- It&#8217;s a safe(no need to worry the dangling reference) way to reference a dynamic object but don&#8217;t own it<br />- A nice feature of weak_ptr is that, it can access the internal state of corresponding shared_ptr object<b>&nbsp;</b></p>
<p><b>6. unique_ptr</b><br />- C++0x&nbsp;<b> </b>introduced a new scoped_ptr like pointer<b>: </b>unique_ptr to replace auto_ptr.<br />- It hide assignment operator and copy constructor<br /><b>- </b>Transfer-of-ownership can be done using std::move() explicitly</p>
<p>These smart points are only suitable for single dynamic object, for object array, use other smart pointers whose name ended as &#8220;_array&#8221;.<br />&nbsp; <br /><span style="font-size: large;"><b>Part II &#8211; Tips for shared_ptr </b></span></p>
<p><b>1. shared_ptr VS weak_ptr</b><br />- shared_ptr <b>owns</b> some heap object<br />- weak_ptr <b>points</b> some heap object<b>&nbsp;</b></p>
<p><b>2. Handling <i>this</i> Pointer</b></p>
<p>It&#8217;s safe to construct a shared_ptr object from a newly created heap object since it&#8217;s not managed by any other shared_ptr object yet. But when you want to pass <b><i>this</i></b> pointer to a function that expects a shared_ptr object, you will encounter a tricky problem because most likely, the heap object is already created and managed by other shared_ptr objects.</p>
<p>The problem is that, in general, you can&#8217;t create a shared_ptr from an existing<i><b>   raw pointer</b></i> &#8211; the new shared_ptr you create won&#8217;t &#8220;know&#8221; about the other   instances that refer to the same object and you&#8217;ll get multiple-deletes.</p>
<p>2.1. Use <a href="http://www.boost.org/doc/libs/1_36_0/libs/smart_ptr/enable_shared_from_this.html" rel="nofollow">enable_shared_from_this</a> from boost library</p>
<p>You can derive from <a href="http://www.boost.org/doc/libs/1_36_0/libs/smart_ptr/enable_shared_from_this.html" rel="nofollow">enable_shared_from_this</a> and then you can use &#8220;<i><b>shared_from_this()</b></i>&#8221; instead of &#8220;<i><b>this</b></i>&#8221; to spawn a shared pointer to your own self object.</p>
<p>How it&#8217;s implemented?<br />- Add a weak_ptr member to point to an existing shared_ptr object that manages this object<br />- When shared_ptr object get constructed from raw pointer to a this kind of object, it will properly set the weak_ptr inside that object<br />-&nbsp; <i><b>shared_from_this() </b></i>will construct a safe shared_ptr object from the weak_ptr member<i><b></b></i><br /><i>- </i>In <a href="http://svn.boost.org/svn/boost/trunk/boost/smart_ptr/shared_ptr.hpp">boost shared_ptr implementation</a>, the &#8220;<i><b>sp_enable_shared_from_this()</b></i>&#8221; function will get called in shared_ptr&#8217;s constructor. In this function, if the passed in dynamic object derives from <i><b>enable_shared_from_this</b></i>, it will set the weak_ptr member using itself.</p>
<p>If you adopt this method, you should be careful not creating such object on stack. Because when creating object on stack, the object is not managed by any shared_ptr, so no shared_ptr&#8217;s constructor gets called and the corresponding weak_ptr member won&#8217;t get set properly.</p>
<p>2.2 If you know that your object is long lived, you can do the following:</p>
<p>struct null_deleter<br />{<br />template <class t=""> void operator()(T *) {}<br />}</p>
<p>Then in your code, just return a shared_ptr<your_type>(this, null_deleter()).</your_type></class></p>
<p><class t=""><your_type> <b>3. Handling Null Valued shared_ptr Object.</b></p>
<p>When you are using shared_ptr in your code, sometimes you need a NULL equivalent stuff to represent a pointer that didn&#8217;t point anything meaningful.</your_type></class></p>
<p><class t=""><your_type>Generally speaking, you have the following choices:</your_type></class>
<ul>
<li>Return iterators and the end iterator if not found</li>
<li>Boost::optional<return_type> </return_type></li>
<li>Silly return codes</li>
</ul>
<p>Out of all the options boost::optional &amp; exceptions (when there  really are exceptional circumstances) are the best methods, if you are  dealing with containers return an iterator to end and test for the end  iterator.</p>
<p>Returning Zero/Null for smart pointers is acceptable in some cases too, when the other alternatives don&#8217;t make sense. Consider the following code:</p>
<p>class some_class_name{<br />public: &nbsp; &nbsp;<br />template&lt;typename T&gt; operator shared_ptr&lt;T&gt;() { return shared_ptr&lt;T&gt;(); } <br />} nullPtr;</p>
<p>Use this template function when any boost::shared_ptr&lt;&gt; typed null pointer is needed.</p>
<p><span style="font-size: large;"><b>[Reference]</b></span></p>
<p>smart pointers overview<br /><a href="http://en.wikipedia.org/wiki/Smart_pointer">http://en.wikipedia.org/wiki/Smart_pointer</a><br /><a href="http://www.informit.com/articles/article.aspx?p=25264">http://www.informit.com/articles/article.aspx?p=25264</a> <br /><a href="http://www.drdobbs.com/184401507">http://www.drdobbs.com/184401507</a><br /><a href="http://dlugosz.com/Repertoire/refman/Classics/Smart%20Pointers%20Overview.html">http://dlugosz.com/Repertoire/refman/Classics/Smart%20Pointers%20Overview.html</a><br /><a href="http://ootips.org/yonat/4dev/smart-pointers.html">http://ootips.org/yonat/4dev/smart-pointers.htm </a></p>
<p>unique_ptr<br /><a href="http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=401">http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=401</a>&nbsp;</p>
<p>shared_ptr<br /><a href="http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=239">http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=239</a></p>
<p>weak_ptr<br /><a href="http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=300">http://www.informit.com/guides/content.aspx?g=cplusplus&amp;seqNum=300</a><br /><a href="http://www.drdobbs.com/184402026;jsessionid=X2T3WUC5FRMSDQE1GHPSKHWATMY32JVN">http://www.drdobbs.com/184402026;jsessionid=X2T3WUC5FRMSDQE1GHPSKHWATMY32JVN</a></p>
<p>shared_ptr for this pointer<br /><a href="http://stackoverflow.com/questions/142391/getting-a-boostshared-ptr-for-this">http://stackoverflow.com/questions/142391/getting-a-boostshared-ptr-for-this</a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2011/01/tips-for-smart-pointers-in-c/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallel Database for OLTP and OLAP</title>
		<link>http://csliu.com/2010/12/parallel-database-for-oltp-and-olap/</link>
		<comments>http://csliu.com/2010/12/parallel-database-for-oltp-and-olap/#comments</comments>
		<pubDate>Mon, 20 Dec 2010 12:33:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[DistributedSystem]]></category>
		<category><![CDATA[WebArch]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=109</guid>
		<description><![CDATA[<p>Just a survey article on materials on parallel database products and technologies for OLTP/OLAP applications. It mainly covers major commercial/academic efforts on developing parallel dbms to solve the ever growing large amount of relational data processing problem.<br />&#160; <br />Part I &#8211; Parallel DBMSs</p> <p>1.1 Parallel Database for OLAP (Shared-Nothing/MPP)</p> <p>TeraData<br />- <a href="http://www.teradata.com/">TeraData Home</a><br [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-size: large;"><span style="font-size: small;">Just a survey article on materials on parallel database products and technologies for OLTP/OLAP applications. It mainly covers major commercial/academic efforts on developing parallel dbms to solve the ever growing large amount of relational data processing problem.</span></span><br /><span style="font-size: large;"><span style="font-size: small;">&nbsp;</span><b> </b></span><br /><span style="font-size: large;"><b>Part I &#8211; Parallel DBMSs</b></span></p>
<p><span style="font-weight: bold;">1.1 Parallel Database for OLAP (Shared-Nothing/MPP)</span></p>
<p>TeraData<br />- <a href="http://www.teradata.com/">TeraData Home</a><br />- <a href="http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;arnumber=183180&amp;userType=inst">Teradata DBC/1012 Paper</a><br />- <a href="http://www.teradatawarehouse.com/t/white-papers/Exadata-the-Sequel-Exadata-V2-is-Still-Oracle/?type=WP">NCR Teradata VS Oracle Exadata</a></p>
<p>Vertica<br />- <span id="goog_1727796562"></span><a href="http://www.blogger.com/">Vertica Home</a><br />- The original research project: <a href="http://db.csail.mit.edu/projects/cstore/">C-Strore</a></p>
<p>Paraccel<br />- <a href="http://paraccel.com/%20">Paraccel Home</a><br />- <a href="http://paraccel.com/technology/massively-parallel-processing-mpp/">MPP Based Architecture</a><br />- <a href="http://paraccel.com/technology/columnar-data-storage/">Columnar Based Storage</a> <br />- Flash Based Storage</p>
<p>DataLlegro(now MS Madison)<br />- <a href="http://www.monash.com/DATAllegro-V3.pdf">Design Choices in MPP Data Warehousing Lessons from DATAllegro V3</a><br />- <a href="http://www.microsoft.com/sqlserver/2008/en/us/parallel-data-warehouse.aspx">Microsoft SQL Server Parallel Data Warehousing</a> </p>
<p>Netezza<br />- <a href="http://www.netezza.com/">Netezza Home</a><br />- <a href="http://www-03.ibm.com/press/us/en/pressrelease/32955.wss">Acquired by IBM</a><br />- Hadoop &amp; Netezza: Synergy in Data Analytics (<a href="http://www.enzeecommunity.com/blogs/nzblog/2010/07/22/hadoop-netezza-synergy-in-data-analytics-part-2">Part 1</a>, <a href="http://www.enzeecommunity.com/blogs/nzblog/2010/07/20/hadoop-netezza-synergy-in-data-analytics-results-in-new-customer-deployment-trends-part-1">Part 2</a>)&nbsp; <br />- Netezza Twinfin VS Oracle Exadata (<a href="http://www.netezza.com/exadata-netezza-compared/">eBook</a>, <a href="http://www.enzeecommunity.com/blogs/nzblog/2010/08/04/four-fundamental-differences-between-twinfin-and-exadata">Blog</a>)</p>
<p>GreenPlum:<br />- <a href="http://www.greenplum.com/">GreenPlum Home</a> <br />- Combined: PostGreSQL/ZFS/MapReduce <br />- <a href="http://deals.venturebeat.com/2010/07/06/emc-greenplum-acquisition/">Acquired by EMC</a></p>
<p>Oracle ExaData:<br />- <a href="http://www.oracle.com/us/products/database/exadata/index.html">ExaData Home</a><br />- OLTP &amp; OLAP Hybrid Orientation<br />- 1 * RAC + N * Exadata Cells (Storage Node) + Infiniband Network<br />- Exadata Cell: Flash Cache + Disk Array + Data Filtering Logic (partial SQL execution)<br />- <a href="http://www.teradatawarehouse.com/t/white-papers/Exadata-the-Sequel-Exadata-V2-is-Still-Oracle/?type=WP">Exadata – the Sequel</a> is a great Exadata study article</p>
<p>IBM DB2 Data Partitioning Feature (can work with both OLAP/OLTP)<br />- formerly known as <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.5815">DB2 Parallel Edition</a> (<a href="http://delivery.acm.org/10.1145/230000/223876/p460-baru.pdf">An Shorter Overview</a>)<br />- <a href="http://www.ibmpressbooks.com/articles/article.asp?p=375537&amp;seqNum=6">DB2 At a Glance &#8211; Data Partitioning Feature</a><br />- <a href="http://www.ibm.com/developerworks/data/library/techarticle/0203lurie/0203lurie.html">Simulating Massively Parallel Database Processing on Linux</a></p>
<p>AsterData: <br />- <a href="http://www.asterdata.com/product/embedded-app.php%20">Supercharging Analytics with SQL-MapReduce</a><br />- <a href="http://www.securitypark.co.uk/security_article263885.html">Aster Data brings Applications inside an MPP Database&nbsp;</a></p>
<p>Misc Articles:<br />- <a href="http://whatis.techtarget.com/definition/0,,sid9_gci214085,00.html">What&#8217;s MPP?</a> <br />- <a href="http://space.itpub.net/673608/viewspace-620367">Comparison of Oracle to IBM DB2 UDB and NCR Teradata for Data Warehousing</a><br />- <a href="http://www.information-management.com/issues/20020501/5129-1.html">SMP or MPP for Data Warehouse</a><br />- <a href="http://www.dbms2.com/2008/09/05/mpp-data-warehouse-nodes/">Dividing the data Warehousing work among MPP Nodes</a><br />- <a href="http://www.dbms2.com/2008/09/06/sans-vs-das-in-mpp-data-warehousing/">SANs vs. DAS in MPP data Warehousing</a><br />- <a href="http://www.dbms2.com/2007/10/12/three-ways-oracle-and-microsoft-could-go-mpp/">Three ways Oracle or Microsoft could go MPP</a></p>
<p><span style="font-weight: bold;">1.2 Parallel Database for OLTP (Shared-Disk/SMP)</span></p>
<p>Oracle Real Application Cluster<br />- <a href="http://download.oracle.com/docs/cd/B10501_01/rac.920/a96597.pdf">Oracle RAC Concepts</a><br />- <a href="http://download.oracle.com/docs/cd/A87860_01/doc/paraserv.817/a76968.pdf">Oracle Parallel Database Server Concepts</a><br />- <a href="http://download.oracle.com/owsf_2003/OOW2003_PPT_36700.pdf">Oracle RAC Case Study on 16-Node Linux Cluster</a></p>
<p>IBM DB2 for z/OS (with <a href="http://en.wikipedia.org/wiki/IBM_Parallel_Sysplex">Sysplex Technology</a>)<br />- <a href="http://www.sswug.org/articles/viewarticle.aspx?id=29395">Share Disk and Share Nothing for IBM DB2</a><br />- <a href="http://catterallconsulting.blogspot.com/2008/06/what-is-db2-data-sharing.html">What&#8217;s DB2 Data Sharing?</a></p>
<p>IBM DB2 for LUW (with <a href="http://www.openfabrics.org/archives/spring2010sonoma/Monday/10.30%20Steve%20Rees%20DB2/DB2%20pureScale%20OpenFabrics%20Rees.pdf">pureScale Technology</a>)<br />- <a href="http://www.databasejournal.com/features/db2/article.php/3894636/IBM-DB2-pureScale-The-Next-Big-Thing-or-a-Solution-Looking-for-a-Problem.htm">IBM DB2 pureScale: The Next Big Thing or a Solution Looking for a Problem?</a><br />- <a href="http://www.ibm.com/developerworks/data/library/dmmag/DBMag_2010_Issue1/DBMag_Issue109_pureScale/">What is DB2 pureScale?</a><br />- DB2 pureScale Scalability (<a href="http://it.toolbox.com/blogs/db2luw/db2-purescale-scalability-part-1-35173">section 1</a>, <a href="http://it.toolbox.com/blogs/db2luw/db2-purescale-scalability-part-2-35413">section 2</a>)</p>
<p><span style="font-weight: bold;"><span style="font-size: large;">Part II &#8211; Academic Readings</span></span></p>
<p><span style="font-weight: bold;">2.1 Overview</span><br /><span style="font-weight: bold;"> </span>1). <a href="http://pages.cs.wisc.edu/%7Edewitt/includes/paralleldb/cacm.pdf">Parallel Database System: The Future of High Performance Database Processing</a><br />2). <a href="http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf%20">Survey of Architecture of Parallel Database System</a><br />3). <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBMQFjAA&amp;url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.58.5370%26rep%3Drep1%26type%3Dpdf&amp;rct=j&amp;q=The%20Case%20for%20Shared%20Nothing&amp;ei=r_QQTe2wIM7KrAfgwqzDCw&amp;usg=AFQjCNFA7BOgsoEjRi88bL1LJgtWUGiuNg&amp;sig2=jevJuVTZ1LrzMF6wQ9zkUg&amp;cad=rja">The Case for Shared Nothing</a><br />4). <a href="http://portal.acm.org/citation.cfm?id=234892">Much Ado About Shared-Nothing</a><b>&nbsp;</b></p>
<p><b>2.2 Research System</b><br />1). <a href="http://www.google.com/#sclient=psy&amp;hl=en&amp;q=XPS:+A+High+Performance+Parallel+Database+Server&amp;aq=f&amp;aqi=&amp;aql=&amp;oq=&amp;gs_rfai=&amp;pbx=1&amp;fp=9bef8cda26d1a6ec">XPS: A High Performance Parallel Database Server</a><br />2). <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBcQFjAA&amp;url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.99.7911%26rep%3Drep1%26type%3Dpdf&amp;rct=j&amp;q=The%20Design%20of%20XPRS&amp;ei=ZfUQTaD5JsWHrAet-MjZCw&amp;usg=AFQjCNFqzEmyoyd76ScdYidaCs1xp1PjRQ&amp;sig2=v9HCIFjxLaR6xjkDnzaz_A&amp;cad=rja">The Design of XPRS</a><br />3). <a href="http://portal.acm.org/citation.cfm?id=627396">Prototyping Buuba, H High Parallel Database System</a><br />4). <a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBMQFjAA&amp;url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.113.6798%26rep%3Drep1%26type%3Dpdf&amp;rct=j&amp;q=The%20Gamma%20Database%20Machine%20Project&amp;ei=lvUQTZnGEsrtrQfhwfiCDA&amp;usg=AFQjCNGHkaEH__8TJGnHc9VTkE0R3KL4WQ&amp;sig2=GZDk7r5tPxOBqe89DdM1vw&amp;cad=rja">The Gamma Database Machine Project</a><br />5). <a href="http://www.hpl.hp.com/techreports/tandem/TR-87.4.pdf%20">NonStop SQL, A Distributed, High-Performance, High-Availability Implementation of SQL</a><br />6). <a href="http://portal.acm.org/citation.cfm?id=166649">Parallel Query Processing in Shared Disk Database System</a><br />7). <a href="http://ieeexplore.ieee.org/iel2/378/4716/00183179.pdf">Architecture of SDC, the Super Database Computer </a></p>
<p><b>2.3 Commercial System</b><br />1). <a href="http://www.springerlink.com/index/yr2503131v2r1410.pdf">A Study of A Parallel Database Machine and Its Performance</a> &#8211; The NCR/TERADATA DBC/1012<br />2). <a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?isNumber=4716&amp;arNumber=183180&amp;isnumber=4716&amp;arnumber=183180&amp;tag=1">A Practical Implementation of the Database Machine</a> &#8211; Teradata DBC/1012<br />3). <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.5815%20">DB2 Parallel Edition</a><br />4). <a href="http://portal.acm.org/citation.cfm?id=1007568.1007666%20">Parallel SQL Execution in Oracle 10g</a><br />6). <a href="http://doi.ieeecomputersociety.org/10.1109/ICDE.2003.1260883">Shared Cache &#8211; The Future of Parallel Database</a><br />7). <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.80.8779">Cache Fusion: Extending Shared-Disk Clusters with Shared Caches</a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/12/parallel-database-for-oltp-and-olap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lecture Notes – AltaVista Indexing and Search Engine</title>
		<link>http://csliu.com/2010/12/lecture-notes-altavista-indexing-and-search-engine/</link>
		<comments>http://csliu.com/2010/12/lecture-notes-altavista-indexing-and-search-engine/#comments</comments>
		<pubDate>Wed, 15 Dec 2010 17:37:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[InfoRetrieval]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=108</guid>
		<description><![CDATA[<p>01/18/2000, <a href="http://research.google.com/pubs/author24014.html">Michael Burrows</a> gave a technical presentation  at UW. In this video, he talked about the design of the AltaVista indexing system and the search engine site. The presentation is short and brief, but covers many core design and concepts which are used in today&#8217;s commercial search engine systems.</p> <p>The presentation video can be [...]]]></description>
			<content:encoded><![CDATA[<p><span id="mediaGroupProductionDate" class="bodytxtblack11pt">01/18/2000, </span><a href="http://research.google.com/pubs/author24014.html">Michael Burrows</a> gave a technical presentation  at UW. In this video, he talked about the design of the AltaVista indexing system and the search engine site. The presentation is short and brief, but covers many core design and concepts which are used in today&#8217;s commercial search engine systems.</p>
<p>The presentation video can be found at uwtv: <a href="http://uwtv.org/programs/displayevent.aspx?rid=2123%20">http://uwtv.org/programs/displayevent.aspx?rid=2123 </a></p>
<p>And I had recreated the PPT used in his video for further use. I tried my best to record the text and redraw the diagrams, but there may be many errors during this process. The copyright is of Mike.</p>
<div style="width:425px" id="__ss_8539397"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/changshuliu/altavista-search-engine-architecture" title="AltaVista Search Engine Architecture" target="_blank">AltaVista Search Engine Architecture</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8539397" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/changshuliu" target="_blank">changshuliu</a> </div>
</p></div>
<p>I think the most interesting design is the<strong> Location Space</strong> and <strong>ISR abstraction</strong>. The first one enables store any information using inverted index mechanism and the second one solve the problem of interpreting complicated search query semantic.</p>
<p>But it&#8217;s not easy to fully understand how the whole ISR system works to serve various query semantic.</p>
<p>And in the second part of his presentation, Mike mentioned many aspects of AltaVista search engine web site. Many of the experiences and designs are still good reference for today&#8217;s Internet web application.<br />
<strong><br />
</strong><br />
<strong>[Reference]</strong><br />
1. http://www.searchenginehistory.com/<br />
2. http://en.wikipedia.org/wiki/Search_engine<br />
3. http://en.wikipedia.org/wiki/AltaVista</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/12/lecture-notes-altavista-indexing-and-search-engine/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Source Insight Tips</title>
		<link>http://csliu.com/2010/11/source-insight-tips/</link>
		<comments>http://csliu.com/2010/11/source-insight-tips/#comments</comments>
		<pubDate>Wed, 24 Nov 2010 05:40:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Engineering]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=105</guid>
		<description><![CDATA[1. Specify file types to add in a project<br />Option -> Document Options -> Document Type -> Include when adding to projects <p> 2. Add new language support Option -> Preferences -> Languages -> Import/Add</p> <p>3. Associate new file type to some language<br />Options -> Document Options -> Document Type -> Add Type <p> 4. [...]]]></description>
			<content:encoded><![CDATA[<div>1. Specify file types to add in a project<br />Option -> Document  Options -> Document Type -> Include when adding to projects</div>
<p> 
<div>2. Add new language support</div>
<div>Option -> Preferences -> Languages -> Import/Add</p>
<p>3. Associate new file type to some language<br />Options -> Document Options -> Document Type -> Add Type</div>
<p> 
<div>4. Add new files to project automatically<br />Project -> Synchronize Files -> Add new files automatically</p>
<p>5. Show full path of source code file <br />Preference -> Display -> Trim Long Path Names With ellipses</p>
<p>6.  Project dependency<br />Option -> Preference -> Symbol Lookups -> Add Project to Path</p>
<p>7. Create common projects<br />Option -> Preference -> Symbol Lookups -> Create Common Projects</p>
<p>8. Colors<br />Background Color: Option -> Preference -> Windows background -> Color<br />Foreground Color: Option -> Preference -> Default Text -> Color</p>
<p>9. Fonts<br />Options -> Document Options -> Document Type -> Screen/Print Font</p>
<p>NOTE: Options -> Style Properties has more control on each element&#8217;s  font and color. You can save all your settings as disk file and share it  with others in this dialog box.</p>
<p>10. Fixed width view<br />View -> Draft View, actually, ignore all style settings</p>
<p>11. Shortcut Keys<br />Use can set using: Options -> Key Assignment</p>
<p>The common default settings are:<br />    Ctr l+ = : Jump to definition<br />    Alt + /   : Look up reference<br />    F3 : search backward<br />    F4 : search forward<br />    F5:  go to Line<br />    F7 : Look up symbols<br />    F8 : Look up local symbols<br />    F9 : Ident left<br />    F10  : Ident right<br />    F12  : incremental search<br />    Alt+, : Jump backword<br />    Alt+. : Jump forward<br />    Shift+F3 : search the word under cusor backward<br />    Shift+F4 : search the word under cusor forward<br />    Shift+F8 : hilight word<br />    Shift+Ctrl+F: search in project</p>
<p>12. Custom Command<br />Options -> Custom command</p>
<p>There are many substitution chars you can use when invoking the command, for example:<br />%f &#8211; full path of current file<br />%l &#8211; line number of current file<br />%d &#8211; full dir path of current file</p>
<p>Full list can be found in SI&#8217;s help doc: Command Reference -> Custom Commands -> Command Line Substitutions</p>
<p>13. Macros</p>
<p>Source Insight provides a C-like macro language, which is useful  for scripting commands, inserting specially formatted text, and automating  editing operations. Macros are saved in a text file with a .EM extension. Once a macro file  is part of the project, the macro functions in the file become  available as user-level commands in the Key Assignments or Menu  Assignments  dialog boxes.</p>
<p>For language reference, see &#8220;Macro Language Guide&#8221; section in SI help doc.</p>
<p>SI&#8217;s web site also contains some sample macro files: <a href="http://www.sourceinsight.com/public/macros">http://www.sourceinsight.com/public/macros</a></p>
<p>14. Special Features</p>
<p>Conditional Parsing:<br />- This is similar to conditional compiling for C/C++, chose what statements to parse<br />- You can change the settings using: Project -> Project Settings -> Conditions</p>
<p>Token Macro:<br />- Similar to C/C++ macro feature, but can be used in other languages<br />- Defined in *.tom file<br />- Put it in our project data directory</p>
<p><b>[Reference]</b></p>
<p>1. <a href="http://blog.csdn.net/better0332/archive/2010/06/23/5689193.aspx">Reading WRK code using Source Insight</a></p>
<p>2. <a href="http://blog.chinaunix.net/u/30708/showart_425405.html">How to use SI to read linux kernel code</a></p>
<p>3. <a href="http://blog.chinaunix.net/u/30708/showart_425405.html">A good macro file for source insight</a></div>
<div><a href="http://blog.chinaunix.net/u/30708/showart_425405.html"><br /></a>  </div>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/11/source-insight-tips/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Integer Variable Length in C/C++ on Different Platforms</title>
		<link>http://csliu.com/2010/10/integer-variable-length-in-cc-on-different-platforms/</link>
		<comments>http://csliu.com/2010/10/integer-variable-length-in-cc-on-different-platforms/#comments</comments>
		<pubDate>Tue, 26 Oct 2010 05:42:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Engineering]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=101</guid>
		<description><![CDATA[<p>While writing codes for multiple platform (in terms of both OS and CPU Arch), making the code independent of the exact byte size of each integer type in c/c++ on each specific platform becomes a typical challenging problem.</p> <p>What about the ANSI C standard regarding this problem?</p> <p>The standard defines 5 standard integer types:<br />- [...]]]></description>
			<content:encoded><![CDATA[<p>While writing codes for multiple platform (in terms of both OS and CPU Arch), making the code independent of the exact byte size of each integer type in c/c++ on each specific platform becomes a typical challenging problem.</p>
<p>What about the ANSI C standard regarding this problem?</p>
<p>The standard defines 5 standard integer types:<br /><span style="font-weight: bold;">- unsigned char</span> <span style="font-weight: bold;">- short int</span> <span style="font-weight: bold;">- int</span> <span style="font-weight: bold;">- long int</span> <span style="font-weight: bold;">- long long in</span>t</p>
<p>It also defines some limitations on these types in <span style="font-weight: bold;">limits.h</span></p>
<p>But it didn&#8217;t say explicitly on the exact byte size of each type.</p>
<p>Common understanding on the standard is that it requires:
<pre class="prettyprint"><code><span style="font-weight: bold;" class="kwd">sizeof</span><span style="font-weight: bold;" class="pun">(</span><span style="font-weight: bold;" class="kwd">short int</span><span style="font-weight: bold;" class="pun">)</span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="pun"><=</span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="kwd">sizeof</span><span style="font-weight: bold;" class="pun">(</span><span style="font-weight: bold;" class="kwd">int</span><span style="font-weight: bold;" class="pun">)</span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="pun"><=</span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="kwd">sizeof</span><span style="font-weight: bold;" class="pun">(</span><span style="font-weight: bold;" class="kwd">long int</span><span style="font-weight: bold;" class="pun">)</span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="pun"></span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="kwd">sizeof</span><span style="font-weight: bold;" class="pun">(</span><span style="font-weight: bold;" class="kwd">long</span><span style="font-weight: bold;" class="pln"> </span><span style="font-weight: bold;" class="kwd">long int)</span><span class="pun"></span><span class="pln"></span></code></pre>
<p>So how about popular compiler&#8217;s documentation on this?</p>
<p>Visual Studio 10 has an <a href="http://msdn.microsoft.com/en-us/library/s3f49ktz%28v=VS.100%29.aspx">article on MSDN describes the exact size of each integer type</a>.</p>
<p>From that article we can see:<br />sizeof(<span style="font-weight: bold;">short int</span>) = 2<br />sizeof(<span style="font-weight: bold;">int</span>) = 4<br />sizeof(<span style="font-weight: bold;">long int</span>) = 4<br />sizeof(<span style="font-weight: bold;">long long int</span>) = 8</p>
<p><span style="font-weight: bold;">And these constrains are true on both 32/64 bit platforms.</span></p>
<p>To help programmer aware of the exact size of integer types they are using, vs 10 introduces some other integer types:<br /><span style="font-weight: bold;">__int8, __int16, __int32, __int64</span> and their unsigned counter parts.</p>
<p>In fact, ANSI c99 also defined those fixed width integer types in <a href="http://en.wikipedia.org/wiki/Stdint.h">stdint.h</a><br /><stdint.h><span style="font-weight: bold;">uint8_t/int8_t</span><br /><span style="font-weight: bold;">uint16_t/int16_t</span><br /><span style="font-weight: bold;">uint32_t/int32_t</span><br /><span style="font-weight: bold;">uint64_t/int64_t</span></p>
<p>To scanf() and printf()? The format string for these types are defined in the standard header &#8211; inttypes.h. For example, this is <a href="http://msinttypes.googlecode.com/svn/trunk/inttypes.h">inttypes.h for visual studio</a>.</p>
<p>And here is a<a href="http://www.remlab.net/op/integer.shtml"> good summary on how to use format strings</a> to deal with integer types in c/c++</stdint.h></p>
<p><stdint.h>[Reference]<br />1. <a href="http://en.wikipedia.org/wiki/Stdint.h">stdint.h in C99</a><br />2. <a href="http://msdn.microsoft.com/en-us/library/s3f49ktz%28v=VS.100%29.aspx">Integer Types in VS10 </a><br />3. <a href="http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf">ANSI C99 Spec</a><br />4. <a href="http://www.drdobbs.com/184401323">Integers in C99</a></p>
<p></stdint.h></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/10/integer-variable-length-in-cc-on-different-platforms/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Alexa and Its Ranking List</title>
		<link>http://csliu.com/2010/10/alexa-and-its-ranking-list/</link>
		<comments>http://csliu.com/2010/10/alexa-and-its-ranking-list/#comments</comments>
		<pubDate>Thu, 14 Oct 2010 07:21:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[CsNotes]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=100</guid>
		<description><![CDATA[<p>I recently read some material talking about the page view ranking of some web sites. It said that the source of the ranking data is <a href="http://www.alexa.com">alexa.com</a>.</p> <p>It&#8217;s good to know source of referenced data but what&#8217;s the confidence of these data source?</p> <p>I had a look that website to learn how the ranking list [...]]]></description>
			<content:encoded><![CDATA[<p>I recently read some material talking about the page view ranking of some web sites. It said that the source of the ranking data is <a href="http://www.alexa.com">alexa.com</a>.</p>
<p>It&#8217;s good to know source of referenced data but what&#8217;s the confidence of these data source?</p>
<p>I had a look that website to learn how the ranking list is generated. Here is my understanding:</p>
<p>1. Alexa’s traffic rankings are based on data collected from <span style="font-weight: bold;">Alexa  Toolbar</span> and other, diverse sources over a  <span style="font-weight: bold;">rolling 3 month</span> period.</p>
<p>2. A site’s ranking is based on a combined measure  of <span style="font-weight: bold;">Reach </span>and<span style="font-weight: bold;"> PageViews</span>.<br />- Reach is determined by the number of unique  Alexa users who visit a site on a given day.<br />- PageViews are the total  number of Alexa user URL requests for a site. However, multiple requests  for the same URL on the same day by the same user are counted as a  single pageview.</p>
<p>3. Sites with relatively low traffic will not be accurately ranked by Alexa.  Traffic rankings of 100,000+ should be regarded as not reliable.  Conversely, the closer a site gets to #1, the more reliable its rank. Since Alexa only uses sampled data from all Alexa Toolbar and Alexa Toolbar in fact is just a small portion of the whole Internet user.</p>
<p>So it seems that the ranking list should not be so authoritative as very few people uses its toolbar. But why it gets so popular and important for many VCs? I guess it&#8217;s mainly due to the lack of other better solutions.</p>
<p>The better data provider should be web browser vendors like Microsoft, Mozilla and Google. But obviously, they are not willing to share with community the data they collected for privacy concerns and potential legal issues.</p>
<p>[<span style="font-weight: bold;">Reference</span>]<br />1. How Reliable Are Your Traffic Ranking?<br /><a href="http://www.alexa.com/faqs/?p=139">http://www.alexa.com/faqs/?p=139</a></p>
<p>2. How are Alexa’s traffic rankings determined?<br /><a href="http://www.alexa.com/faqs/?p=134">http://www.alexa.com/faqs/?p=134</a></p>
<p>3. About the Alexa Traffic Rankings<br /><a href="http://www.alexa.com/help/traffic_learn_more">http://www.alexa.com/help/traffic_learn_more</a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/10/alexa-and-its-ranking-list/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disciplines in Microsoft Engineering Team</title>
		<link>http://csliu.com/2010/09/disciplines-in-microsoft-engineering-team/</link>
		<comments>http://csliu.com/2010/09/disciplines-in-microsoft-engineering-team/#comments</comments>
		<pubDate>Thu, 23 Sep 2010 15:37:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[Engineering]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=99</guid>
		<description><![CDATA[I really want this blog to be a place to express my own ideas and thoughts, but I don&#8217;t refuse reference other people&#8217;s great ideas, especially when they are really helpful for me or potential readers.</p> <p>The following content is copied from a MSDN blog post named- <a href="http://blogs.msdn.com/b/prakas/archive/2008/01/27/product-development-disciplines-at-microsoft.aspx">Product Development Disciplines at Microsoft,</a> I just [...]]]></description>
			<content:encoded><![CDATA[<div>I really want this blog to be a place to express <span style="font-weight: bold;">my</span> <span style="font-weight: bold;">own</span> ideas  and thoughts, but I don&#8217;t refuse reference other people&#8217;s great ideas, especially when they are really helpful for me or potential readers.</p>
<p>The following content is copied from a MSDN blog post named- <a href="http://blogs.msdn.com/b/prakas/archive/2008/01/27/product-development-disciplines-at-microsoft.aspx">Product Development Disciplines at Microsoft,</a> I just highlighted some lines.</p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">&#8220;Over  the last several months in my role here in China, I have given talks at  several leading universities and met with many of the leading faculty  and students working on technologies related to the Data Platform. I’ve  also spoken at several industry conferences, meeting with customers,  partners, analysts and other industry folks. There are many topics that  come up at these meetings – changing technology trends, distributed  development, the tremendous growth of Asia etc. But one topic that seems  to come up more than almost any other is the question of how we  organize and conduct our product development in Microsoft. I suppose  this is only natural – Microsoft is one of the most successful software  companies in the world, and the software industry here in this region is  poised for tremendous growth, so it makes sense that people in the  industry are eager to learn from the our experience over the last  quarter century.</span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">This  is actually a very big topic and <span style="font-weight: bold;">within Microsoft we have an  Engineering Excellence group that actually runs courses that can span  several days and provide an overview of Microsoft’s software development  methodology, our engineering system, organizational structures, best  practices, tools and technologies we use internally ensure quality,  reliability, security etc and a variety of related topic.</span> By no means  would we claim that we have all this figured out perfectly and have a  perfect system, but there is indeed a lot of accumulated knowledge and  experience that we can share. And we do actually share this information,  in appropriate form, with others in our industry, worldwide and also in  this region. </span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">As  this is indeed a large topic, I don’t want to get too deep into this  here, but I do want to address one aspect of our engineering system –  the core disciplines that we organize our R&amp;D teams around and the  particular roles that each of these disciplines plays. I want to discuss  this because I believe Microsoft does this a little bit differently  from the rest of the industry even in the US, and especially here in  China there is not a good understanding of these core disciplines and  what role each of them plays.</span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">Traditionally,   <span style="font-weight: bold;"> the Microsoft engineering system has consisted of 3 “core” disciplines:  “Development”, “Test”, and “Program Management”, also known as  Dev/Test/PM for short.</span> I’m going to touch on each of these briefly here,  but I like to introduce them in a different order:</span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;"><span><b><u>PM</u>:</b> When we think of engineering disciplines, most people start with “Dev”. <span> </span>For  me however, things really start with the Program Management discipline.  At Microsoft, “PM” means many different things, but for me the core  essence of the PM role is two things:</span></span></p>
<p style="margin: 0in 0in 0pt 0.5in;"><span><span><span style="font-size:100%;">1.</span><span style="font: 7pt 'Times New Roman';">       </span></span></span><span style="font-size:100%;">The first part of the PM’s job is to <span style="font-weight: bold;">understand the customer’s requirements and translate that into a </span><i style="font-weight: bold;">functional specification</i><span style="font-weight: bold;">  of what we should build.</span> This is where it all begins. If we don’t  understand the customer, it is not very likely that we’ll end up  building the right thing.</span></p>
<p style="margin: 0in 0in 10pt 0.5in;"><span><span><span style="font-size:100%;">2.</span><span style="font: 7pt 'Times New Roman';">       </span></span></span><span style="font-size:100%;">The  second part of the PM’s job is to <span style="font-weight: bold;">work with Dev and Test to translate  the initial specification into a living, breathing product.</span></span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">I  find that many people, especially here in China, think “Project  Management” when they hear PM. Indeed, Project Management is part of a  PM’s job (under #2 above), but it is only a part of the PM’s job. The  real skill that a PM brings is the expertise to listen to customers,  understand the world from their point of view, and then to design a  solution for their problem. This does not just mean giving customers  what they ask for literally, but to truly understand them and design a  solution that solves their problems even if the customers could never  imagine the solution – as the famous saying goes, if we had only  listened to customers, we would have looked for a faster horse, not come  up with the automobile.</span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;"><span><b><u>Dev</u>: </b><span> </span>Of  all the engineering disciplines, this one is probably the one people  think about the most commonly. Dev is short-hand for “Development”, the  folks who responsibility it is to actually design and build the software  that we ship. The essential job of Dev is to <span style="font-weight: bold;">take the </span><i style="font-weight: bold;">functional specification </i><span style="font-weight: bold;">produced  by PM and translate that into an actual implementation</span>. In the world of  mission-critical system-level software, this implementation better be  extremely reliable, secure, manageable, scalable and high-performance.<span>  </span>And the designs and implementations Dev produces better stand the test of time and last for several versions and years to come.</span></span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;"><span><b><u>Test</u></b>:<b> </b>The<span> t</span>est <span>discipline  in Microsoft is much misunderstood, certainly externally, but sometimes  internally as well. When I first came to Microsoft many years ago, I  was (pleasantly) surprised to find that <span style="font-weight: bold;">Microsoft had almost as many, if  not more, testers as developers.</span> Coming from a company that had a much  less developed testing discipline (and where as a result, quality  assurance was considerably weak), it took a little while to get used to  what the essence of the Test discipline really is. The reality is that, <span style="font-weight: bold;"> in Microsoft, how fast we can ship software depends on not how quickly  we can design and implement it but rather on how quickly we can test it.</span>  This is because every piece of software we ship, especially on the  systems-software side, has to pass an extremely high quality bar. The  Test discipline is really an complex area, and one where have learned a  lot over the years in terms of different types of testing that we employ  – <span style="font-weight: bold;">unit tests, functional test, integration tests, stress and long-haul  tests, performance tests, security tests, localization tests, etc.</span>  The set of tools and techniques we employ in test is truly some of the  most impressive and complex – automated test harnesses, automated test  generators, automated test failure analyzers, automated security  “fuzzers”,<span>  </span>fail-point and state-machine based testing. </span></span></span></p>
<p style="margin: 0in 0in 10pt;"><span><span style="font-size:100%;"><span>The  three “core” engineering disciplines described above are like the 3  legs of a chair – you need all three of them, and in a balance, to have a  proper engineering organization. No one leg can dominate the other –  otherwise, you get an organization that may not be in touch with  customers needs or one that does not pay enough attention to quality.  Indeed, the three disciplines are a little bit like the branches of  government – they form a system of checks and balances that ensures we  understand what customers want, we design and build that with high  quality, and we ensure that we deliver a product that meets customer  expectations in every regard.</span></span></span></p>
<p style="margin: 0in 0in 10pt;"><span><span style="font-size:100%;"><span>It  is also important to emphasize that we aim to attract the best talent  to all three core disciplines – the bar is equally high for all the  disciplines, it just happens to be that the passion and skill-set for  each is a little different:<span style="font-weight: bold;"><br /> </span></span></span></span></p>
<p style="margin: 0in 0in 10pt;"><span><span style="font-size:100%;"><span><span style="font-weight: bold;">- PMs usually have a passion for working with  customers, conceptualizing what the product should do, and then working  with their Dev and Test peers to coordinate all the work to make sure we  deliver exactly that.</span> <span style="font-weight: bold;"><br />         </span></span></span></span></p>
<p style="margin: 0in 0in 10pt;"><span><span style="font-size:100%;"><span><span style="font-weight: bold;">- Developers have a passion for building  top-quality software – software that is innovative, simple, reliable,  secure, scalable, high-performance and stands the test of time.<br />         </span></span></span></span></p>
<p style="margin: 0in 0in 10pt;"><span><span style="font-size:100%;"><span><span style="font-weight: bold;">- </span></span></span></span><span><span style="font-size:100%;"><span><span style="font-weight: bold;">Testers are passionate about finding all kinds of ways to break software  and making sure making sure we find all the issues and bugs </span><i style="font-weight: bold;">before </i><span style="font-weight: bold;">we ship it to customers.</span></span></span></span><span><span style="font-size:100%;"><span> <span style="font-weight: bold;"></span></span></span></span></p>
<p style="margin: 0in 0in 10pt;"><span><span style="font-size:100%;"><span>When  we interview candidates, a very important part of what we do is find  out which discipline the person’s talent and passion really lie in and  directs them accordingly. Of course, over the course of one’s career,  one’s passion and talent may change, and the person may change  disciplines as a result – I myself started in the Dev discipline before  switching to PM. This is only natural and we actually encourage that as a  way to build better teams.</span></span></span></p>
<p style="margin: 0in 0in 10pt;"><b><span style="font-size:100%;"><span>Other disciplines</span></span></b></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">It  is also important to point out that although the three disciplines  mentioned above are what have traditionally been considered the “core”  disciplines at Microsoft, there are several other disciplines that are  also becoming increasingly important. For example, <b>User Experience (UX)</b>  professionals are essential to ensuring that products are intuitive and  natural for users to use. A great user experience can make the  difference a product that customers love versus one they merely  tolerate. UX is certainly very important for products aimed at end  consumers, but it is also important for all our audiences – Developers,  IT Professionals, Information Workers. </span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">As  we move into the Software+Services era, a variety of disciplines  related to architecting, building and running extremely large-scale  infrastructure becomes increasingly important. Again, while this has  been true for some time for our consumer facing web properties such as  MSN and Live, it is now becoming increasingly important for <i>all</i> our product groups as more and more of them take steps to evolve their products along the Software+Services model.</span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">Many  candidates I talk to often want to discuss what role at Microsoft would  be the best fit for them and how they can grow their careers. The best  advice I can think of is to<span style="font-weight: bold;"> work on a technology and a role that they  are really passionate about.</span><br /></span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">As I mentioned above, we value all the  disciplines equally and a well-balanced organization needs great people  in all the different roles. While different disciplines appeal to people  with different passions and skill-sets, <u>all</u> the disciplines  offer opportunities for innovation and great work. And all of them offer  opportunities for advancement and leadership. Indeed if you look across  the senior levels of Microsoft, there are leaders who emerged from  various disciplines – what they shared was a passion for what the work  they were doing.</span></p>
<p style="margin: 0in 0in 10pt;"><span style="font-size:100%;">I  hope this discussion of the different engineering disciplines at  Microsoft and the approach we take to them shall be useful for the many  people who seem to be interested in this topic. If you have any  questions or comments, feel free to post a reply to his entry.&#8221;</span></p>
</p></div>
<div>        </div>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/09/disciplines-in-microsoft-engineering-team/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Relevance Measuring in Information Retrieval System</title>
		<link>http://csliu.com/2010/09/relevance-measuring-in-information-retrieval-system/</link>
		<comments>http://csliu.com/2010/09/relevance-measuring-in-information-retrieval-system/#comments</comments>
		<pubDate>Sun, 05 Sep 2010 09:47:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[InfoRetrieval]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=97</guid>
		<description><![CDATA[<p>One of the challenges Information Retrieval system faces is Relevance Quality. It&#8217;s the main factor that determines end end user&#8217;s happiness. (The other two are latency and corpus size)</p> <p>To design and implement a IR system that has high relevance quality, we must have some methods to measure the quality of relevance.</p> <p>Generally speaking, a [...]]]></description>
			<content:encoded><![CDATA[<p>One of the challenges Information Retrieval system faces is Relevance  Quality. It&#8217;s the main factor that determines end end user&#8217;s happiness. (The  other two are latency and corpus size)</p>
<p>To design and implement a  IR system that has high relevance quality, we must have some methods to  measure the quality of relevance.</p>
<p>Generally speaking, a measuring system consists of three components:<br />- Test Corpus (Document Collection for Test purpose)<br />- Test Query Set (Set of Queries for Test)<br />-  Measuring Parameter (usually a function, used to measure the retrieval  result of an IR system for some query in the query set, using the test  corpus)</p>
<p>Test Corpus/Query is another story and we only focus on measuring parameter/function here.</p>
<p>1. <span style="font-weight: bold;">Precision/Recall</span> for un-ranked retrieval result</p>
<p>Precision  = #Relevant Documents Retrieved / #Retrieved Documents, it&#8217;s the   percentage of the returned documents that are really relevant to the  user query. (查准率)</p>
<p>Recall = #Relevant Documents Retrieved / #Total  Relevant Document, it&#8217;s the percentage of the relevant document in the  corpus that is retrieved in the query result. (查全率)</p>
<p>2. <span style="font-weight: bold;">NDCG</span> for ranked retrieval result</p>
<p><span style="font-weight: bold;">NDCG</span> stands for <span style="font-weight: bold;">N</span>ormalized <span style="font-weight: bold;">D</span>iscounted <span style="font-weight: bold;">C</span>umulative <span style="font-weight: bold;">G</span>ain, which is a human rating based measuring system.</p>
<p><span style="font-weight: bold;">Gain</span>  &#8211; user will assign a numeric value (which is a score gained) to  represent the goodness of a returned document for some specific query  request.</p>
<p><span style="font-weight: bold;">Cumulative Gain</span> &#8211;  user will assign gain value for each document in the top K returned  results, the values is assigned individually and independently.
<dl>
<dd><img style="width: 125px; height: 56px;" alt=" \mathrm{CG_{p}} = \sum_{i=1}^{p} rel_{i} " src="http://upload.wikimedia.org/math/b/1/8/b18be8f1425b5f39a41cd40b58fbd6a3.png" /></dd>
</dl>
<p><span style="font-weight: bold;">Discounted Cumulative Gain</span>  &#8211; when assigning relevance score to the returned document, there is a  weight related to the order of the document in the retrieval result.
<dl>
<dd><img style="width: 214px; height: 55px;" alt=" \mathrm{DCG_{p}} = rel_{1} + \sum_{i=2}^{p} \frac{rel_{i}}{\log_{2}i} " src="http://upload.wikimedia.org/math/d/f/5/df5560ab4a13fbe3a4bd3b07b61f7d75.png" /></dd>
</dl>
<p><span style="font-weight: bold;">Normalized Discounted Cumulative Gain</span>  &#8211; it&#8217;s easy to understand: make the final value to be [0, 1]. Usually,  the DCG score of the ideally ordered (ordered using Gain score) document  list is used as the normalizing factor. So
<dl>
<dd><img style="width: 165px; height: 48px;" alt=" \mathrm{nDCG_{p}} = \frac{DCG_{p}}{IDCG{p}} " src="http://upload.wikimedia.org/math/c/6/6/c66e4f0c861568bbcdaa22b86446b8a0.png" /></dd>
</dl>
<p>For concrete example of how to compute the NDCG value of a query result, please see <a href="http://en.wikipedia.org/wiki/Discounted_cumulative_gain#Example">wiki on NDCG</a></p>
<p>NDCG  is widely used in today&#8217;s commercial search engine evaluation, but the  problem is that, if the returned document is ordered in the same way as  the decreasing order of gain score, the NDCG value will be the max:1.</p>
<p>This means that, NDCG is only used for the measuring the ranking  algorithm of a search engine and can&#8217;t tell whether the returned document  is highly related to the user intention or not. But in end user&#8217;s perspective, the  perfect return result should be highly related document ordered  properly.</p>
<p>More technically,  a typical query serving sub-system of an IR system has two phases, one  is matching (find highly related document), and the other is ranking  (order the matched documents). NDCG may be a proper tool to measure the  ranking phase, but definitely not the matching phase. So I think is not  an ideal measuring mechanism for IR system.</p>
<p>So, tuning the whole system against NDCG score only may not be a correct direction for search engine improving.</p>
<p>Update@07/09<br />- The ideal set, which is used to calculate the normalization factor, is the highly scored documents list ordered properly, not the proper order of the returned documents. So the problem I mentioned above doesn&#8217;t exist.<br />- But the final effect of this measuring method depends on what test corpus, what test query, what the predefined gain score for each query.</p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/09/relevance-measuring-in-information-retrieval-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lecture Notes – Evolution of Google Search Engine</title>
		<link>http://csliu.com/2010/08/lecture-notes-evolution-of-google-search-engine/</link>
		<comments>http://csliu.com/2010/08/lecture-notes-evolution-of-google-search-engine/#comments</comments>
		<pubDate>Fri, 06 Aug 2010 02:39:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[DistributedSystem]]></category>
		<category><![CDATA[InfoRetrieval]]></category>
		<category><![CDATA[WebArch]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=96</guid>
		<description><![CDATA[<p><a href="http://research.google.com/people/jeff/index.html">Jeff Dean</a> gave a keynote <a href="http://www.wsdm2009.org/dean_abs_bio.php">Building Large Scale Information Retrieval Systems</a> at <a href="http://www.wsdm2009.org/">WSDM 2009</a>. It&#8217;s actually a presentation on how Google search engine evolves during the past 10 years. Here are my notes for this lecture.</p> <p>Part I &#8211; Overview of Search Engine Evolution: 1999 VS 2009</p> <p>Factors to Consider when Designing [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://research.google.com/people/jeff/index.html">Jeff Dean</a>  gave a keynote <a href="http://www.wsdm2009.org/dean_abs_bio.php">Building  Large Scale Information Retrieval Systems</a> at <a href="http://www.wsdm2009.org/">WSDM 2009</a>. It&#8217;s actually a presentation on how Google search engine evolves during the past 10 years. Here are my notes for this lecture.</p>
<p><span style="font-size:130%;"><span style="font-weight: bold;">Part I &#8211; Overview of Search Engine Evolution: 1999 VS 2009</span></span></p>
<p>Factors to Consider when Designing a Information Retrieval System:<br />1. Corpus Size(# docs to be indexed)<br />2. QPS(Query Per Second)<br />3. Freshness/Update Rate<br />4. Query Latency<br />5. Complexity/Cost of Scoring/Retrieval Algorithm</p>
<p>Parameter Change 1999 -> 2009:<br />1. Corpus Size: 70M -> *B <span style="font-style: italic; font-weight: bold;">~100X</span><br />2. QPS: <span style="font-weight: bold; font-style: italic;">~1000X</span><br />3. Refresh: Months -> Minutes <span style="font-weight: bold; font-style: italic;">~10000X</span><br />4. Latency: <1s> <02.s style="font-weight: bold; font-style: italic;">~5X<br />5. Machine Scale: <span style="font-weight: bold; font-style: italic;">~1000X</span></p>
<p><span style="font-weight: bold; font-style: italic;">Consider 10x Growth when designing, Rewrite for 100x Growth!</span></p>
<p><span style="font-weight: bold;"><span style="font-size:130%;">Part II -</span> </span><span style="font-size:130%;"><span style="font-weight: bold;">Evolution of Google Search Engine</span></span></p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_14.480.jpg"><img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 480px; height: 360px;" src="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_14.480.jpg" alt="" border="0" /></a><span style="font-weight: bold;">~1997 &#8211; Circa</span>, <span style="font-weight: bold;">Research Prototype</span></p>
<p>- Simple Architecture and Focus on System Distributing/Partitioning<br />- Term vs Doc based Partition: Doc based Win<br />- Disk Based Index, DocID+Posting List with Position Attributes, Byte Aligned Encoding</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_20.480.jpg"><img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 480px; height: 360px;" src="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_20.480.jpg" alt="" border="0" /></a><br /><span style="font-weight: bold;">~1999 &#8211; Circa, Production</span></p>
<p>- Introduced Cache<br />&#8211; hit rate is low 30~60% due to index refresh and long tail query<br />&#8211; very beneficial, reduce large disk i/o<br />&#8211; hot term first priority to cache, hot and costy request</p>
<p>- Replica Index Data<br />&#8211; better performance<br />&#8211; better availability</p>
<p>Some Summary in late 1990&#8242;s:</p>
<p>Crawler is simple batch system<br />- start with very few urls<br />- queue it when found new urls<br />- stop when have enough docs</p>
<p>Index Serving using cheap machine<br />- no failure handling<br />- added record/chunk checksum</p>
<p>Index Update<br />- once/month<br />- wait traffic to low -> take replica offline -> do update -> start serving</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_43.480.jpg"><img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 480px; height: 360px;" src="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_43.480.jpg" alt="" border="0" /></a><br /><span style="font-weight: bold;">~2000 &#8211; Dealing with Growth</span></p>
<p>Situation:</p>
<p>- doc size:50m -> 1000m<br />- ~20% query traffic increase/month<br />- Yahoo! deal</p>
<p>Solution:</p>
<p>add machines constantly<br />- more index shards for larger index size<br />- more index replica for bigger query capacity</p>
<p>And improve software constantly<br />- better disk scheduling<br />- better index encoding</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_49.480.jpg"><img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 480px; height: 360px;" src="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_49.480.jpg" alt="" border="0" /></a><br /><span style="font-weight: bold;">~2001 &#8211; Adding In-Memory Index</span></p>
<p>- enough machine memory: holding all index in mem<br />- machine function: replica -> micro shard holding<br />- balancer: cordinator<br />- availability: replicate important docs</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_53.480.jpg"><img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 480px; height: 360px;" src="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_53.480.jpg" alt="" border="0" /></a><br /><span style="font-weight: bold;">~2004 &#8211; Adding Infrastructure</span></p>
<p>- Generalize tree structured query flow<br />- Generalize balancer concept<br />- New index encoding: group varint encoding<br />- GFS appear in production</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_64.480.jpg"><img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 480px; height: 360px;" src="http://carbon.videolectures.net/2009/other/wsdm09_barcelona/dean_cblirs/wsdm09_dean_cblirs.zip.slides/wsdm09_dean_cblirs_Page_64.480.jpg" alt="" border="0" /></a><br /><span style="font-weight: bold;">~2007 &#8211; Universal Search</p>
<p></span>- Universal search: combine results from multiple vertical corpus<br />- Realtime search: fast url finding -> crawling -> indexing -> serving cycle<br />- Experiment supporting: have idea -> try it on real data offline and tune -> live experiment on small piece -> roll out and launch</p>
<p><span style="font-size:130%;"><span style="font-weight: bold;">Part III &#8211; Future Trends</span></span></p>
<p>- Cross Language Information Retrieval<br />- ACL in large IR system with huge amount of user and dynamic requirement<br />- Automatic Construction of Efficient IR system (one bin for realtime and regular web index with different parameter configuration)<br />- Info extraction from semi-structured data</p>
<p><span style="font-weight: bold;">[Reference]</p>
<p></span>1. Challenges in Building Large Scale Information Retrieval Systems[<a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/WSDM09-keynote.pdf">PDF</a>, <a href="http://www.blogger.com/Challenges%20in%20Building%20Large-Scale%20Information%20Retrieval%20Systems">Video</a>]<br />2. Notes by Another Blogger &#8211; <a href="http://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html">http://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html</a></p>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/08/lecture-notes-evolution-of-google-search-engine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Book Notes – The Long Tail</title>
		<link>http://csliu.com/2010/07/book-notes-the-long-tail/</link>
		<comments>http://csliu.com/2010/07/book-notes-the-long-tail/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 09:48:00 +0000</pubDate>
		<dc:creator>csliu</dc:creator>
				<category><![CDATA[BookReview]]></category>
		<category><![CDATA[Business]]></category>

		<guid isPermaLink="false">http://csliu.com/?p=95</guid>
		<description><![CDATA[<p>Just a book review for the famous book about Internet economics:</p> <p><a href="http://www.amazon.com/Long-Tail-Revised-Updated-Business/dp/1401309666"></a></p> <p>The review is written using Chinese since I read the Chinese version of this book. To keep the pure English characteristic of this blog, I moved the book review to another place: <a href="http://daomucun.blogspot.com/2010/07/book-notes-long-tail.html" target="_blank">Book Notes &#8211; The Long Tail</a>. Please follow [...]]]></description>
			<content:encoded><![CDATA[<div>
<p>Just a book review for the famous book about Internet economics:</p>
<p><a href="http://www.amazon.com/Long-Tail-Revised-Updated-Business/dp/1401309666"><img id="BLOGGER_PHOTO_ID_5502971706669257682" class="aligncenter" src="http://4.bp.blogspot.com/_qN4XeajjPQ8/TF57nHkL09I/AAAAAAAAA1A/xvJ0tRRFmQk/s320/book_the-long-tail1.jpg" border="0" alt="" /></a></p>
<p>The review is written using Chinese since I read the Chinese version of this book. To keep the pure English characteristic of this blog, I moved the book review to another place: <a href="http://daomucun.blogspot.com/2010/07/book-notes-long-tail.html" target="_blank">Book Notes &#8211; The Long Tail</a>. Please follow the link if you can read Chinese.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://csliu.com/2010/07/book-notes-the-long-tail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

