tag:blogger.com,1999:blog-112652282024-03-17T20:04:01.743-07:00techUnknownnoreply@blogger.comBlogger170125tag:blogger.com,1999:blog-11265228.post-43405180519150731962024-03-13T14:55:00.000-07:002024-03-13T14:55:03.716-07:00Postgres indices on large tables : gotchas<p>When you run a migration to create an index on Postgres, to allow queries to run, we create the index using the `CONCURRENTLY` flag.</p><p>But if the migration fails for any reason, the index will be created but unusable. Say, you were adding a UNIQUE index, and the migration fails since it encounters a duplicate. So the migration fails with something like:</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjN6gTIU0Wk4ZfJC5PuGSo9WJeEkndhvEcAGPFe5Mf4KCSflsI8FZKvtjn86NECdIW7vcFkAXntW4NlTm36RgB2TdNqFGCUe2yXx0PcHsSIf2FxCMR5MiAcWbdtQ34fc0uFE8iGVSMp80TcPZXJBrL8KlKyOEaKAP82wgceN5hfAx4RHcIrT8d4FQ/s1678/Screen%20Shot%202024-03-13%20at%202.49.33%20PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="232" data-original-width="1678" height="88" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjN6gTIU0Wk4ZfJC5PuGSo9WJeEkndhvEcAGPFe5Mf4KCSflsI8FZKvtjn86NECdIW7vcFkAXntW4NlTm36RgB2TdNqFGCUe2yXx0PcHsSIf2FxCMR5MiAcWbdtQ34fc0uFE8iGVSMp80TcPZXJBrL8KlKyOEaKAP82wgceN5hfAx4RHcIrT8d4FQ/w640-h88/Screen%20Shot%202024-03-13%20at%202.49.33%20PM.png" width="640" /></a></div><p><br /></p><p>Now imagine that your migration needed to make an existing index UNIQUE. It is usually done by first removing the index, and adding it back with the UNIQUE constraint.</p><p>Since the migration is aborted at the point of creating the new index with the constraint, now there is no index. The app will be running without an index.</p><p>If you were to check the db, you might get confused as the index *seems* to be there - it is just not operable.</p><p>To verify, run an `EXPLAIN` and you will see it does not use the index.</p><p>Now the queries that regularly use the index will be running quite slowly, as the index is not quite ready. This could add a lot of pressure to the db and make the app pretty much unusable / inoperable.</p><p><br /></p><p><br /></p><p><br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-32958647819461050942024-01-10T14:39:00.000-08:002024-01-10T14:39:06.117-08:00Ruby lazy collections and with_index<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjNSFBlP7iXwQ997zFEdTLMCaxF_KuuNPKfm3_-RbGzCpZGCRr7hjNMhpc-glAp5WQjUKH-Z9QhUWOLhoGy7-0xPUiJy50nFGAY2TTyjHa_VTEN9bV_KmKkghxPJLXmg05nwARVtJaayViE-cO9UtAPHKzkfGQ2Y2LgKFGr7zLwczYAEn73YL6SRw" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="" data-original-height="588" data-original-width="940" height="200" src="https://blogger.googleusercontent.com/img/a/AVvXsEjNSFBlP7iXwQ997zFEdTLMCaxF_KuuNPKfm3_-RbGzCpZGCRr7hjNMhpc-glAp5WQjUKH-Z9QhUWOLhoGy7-0xPUiJy50nFGAY2TTyjHa_VTEN9bV_KmKkghxPJLXmg05nwARVtJaayViE-cO9UtAPHKzkfGQ2Y2LgKFGr7zLwczYAEn73YL6SRw" width="320" /></a></div><br />The lazy collections don't quite give us a way to use `with_index` , but, we can use `each_with_index`.<p></p><p>It took me a moment to figure out the problem was with the collection being lazy.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-58307033545152090782022-08-25T11:33:00.001-07:002022-08-25T11:40:46.226-07:00Using Redis for web request counting<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjliVxPSESbGhLTMl1lK7amAZAnDBr3wg2LohY5F4Ncih5G3EnSYAozaU3u_J0jnY4N0T9jh4nPM-4w4UE3ReRAgDKPap3ZYLMQRZ3CH2FPK3Mt82XkluWr_N7d_5eE5suJ7vOY5LnMyJqUYOtZjIbqx6uTr16ir-H3EL79adtoaBfzuXygRLQ/s960/Screen%20Shot%202022-08-25%20at%2011.38.58%20AM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="486" data-original-width="960" height="162" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjliVxPSESbGhLTMl1lK7amAZAnDBr3wg2LohY5F4Ncih5G3EnSYAozaU3u_J0jnY4N0T9jh4nPM-4w4UE3ReRAgDKPap3ZYLMQRZ3CH2FPK3Mt82XkluWr_N7d_5eE5suJ7vOY5LnMyJqUYOtZjIbqx6uTr16ir-H3EL79adtoaBfzuXygRLQ/s320/Screen%20Shot%202022-08-25%20at%2011.38.58%20AM.png" width="320" /></a></div><br /><h2 style="text-align: left;"><br /></h2><h2 style="text-align: left;">Problem Statement</h2><p>With the growth of our user base and increasing traffic to the web servers, we wanted to come up with some realtime counters to measure traffic.</p><p>We wanted the system to give us an idea of:</p><p></p><ul style="text-align: left;"><li>The total web requests the site is receiving each minute</li><li>These totals aggregated hourly and displayed for the day (24 entries)</li><li>The counts broken up by the specific request path</li><li>The number of unique visitors broken up by the minute, hour and day</li></ul><h2 style="text-align: left;">Limitations of available solutions</h2><div>We use Heroku for deploying the app. The Heroku dashboard provides us a rough idea of the request counts, but it is difficult to incorporate that into our own operational dashboards. Also Heroku doesn't break down the counts by the request path, which was important to us in understanding the source of traffic and how we could scale resources based on the source. Unique visitor counts are similarly an important metric not readily available.</div><div><br /></div><h2 style="text-align: left;">Counters help scale the app</h2><div>Breaking down the traffic by the source was important in coming up with a way to efficiently scale our service. Currently most of our application sits in a single service. So scaling the app means we add machines to this single fleet, even though only requests from one or two sources really benefit from this added capacity.</div><div><br /></div><div>We have 3 main sources of traffic:</div><div><ol style="text-align: left;"><li>Our users hit the service for various features provided from their mobile app</li><li>Our users send us their geo location from the mobile app</li><li>We receive near real-time data of our users gig work from a third party</li></ol><div>Our primary application performance goal is providing a sub second experience to users as they use our mobile app; thus mainly, we want to optimize resourcing on the backend with a focus on 1.</div></div><div><br /></div><div>However, we get so much more traffic from 2. and 3. which consume most of the server bandwidth. Keeping all three services as a single service degrades the experience for the user.</div><div><br /></div><div>Mind you, 2. and 3. do not have a real time processing requirement. While a web server is needed to accept these requests, the actual processing is handled by an asynchronous worker outside of the web server.</div><div><br /></div><div>But still, since there are so many of these web requests, for the few milliseconds each of them sit on the web server, it takes away bandwidth from the user requests.</div><div><br /></div><h2 style="text-align: left;">Why Redis?</h2><div>Redis provides the ideal data structures for counting requests with minimal overhead. (And must I say that it is fast, as in really, really fast) A few keys can be used to keep a running total for each minute per request type, then a <a href="https://redis.io/docs/data-types/sorted-sets/" target="_blank">sorted set</a> can be used to aggregate the last N minutes of requests. (For our purposes, we decided to keep 60 counts, thus providing a picture of activity for the last hour, but you can choose to keep counts for longer than that.) </div><div><br /></div><div>The same idea can be extended to measure days worth of traffic broken by the hour.</div><div><br /></div><h2 style="text-align: left;">Choice of the Sorted Set</h2><div>Why did we decide on the sorted set for aggregating the counters? Well, the sorted set allows us to have the counters sorted by time. This way, we can quickly get the list of counts for the last hour ordered from minute 1 down. To be fair, it is a bit overkill to use a set for this, as we are never going to be mutating an element (since the timestamp is always increasing), but it does suit our purposes just fine!</div><div><br /></div><div>Before going any further, let us briefly recap the salient features of the sorted set. It allows us to insert elements along with a score, and the elements are sorted in real time by the score. It <a href="https://redis.io/docs/data-types/sorted-sets/" target="_blank">scales</a> really well for even millions (or more) of elements as each insert operation takes O(log(n)) time -- much like a binary tree. While we do not need that level of scale, one can think of keeping extremely granular counts for a long period of time, which could come in handy for debugging bizzare performance problems after the fact!</div><div><br /></div><div>We can use the timestamp as the score. Redis will then always sort the set by the timestamp. This has the advantage that if you wanted to change the counter later (imagine a design where you quickly provide a rough estimate of the count, but later do a second pass to provide exact counts), you can simply add a new count to the set with the same timestamp and the position of the element will not change.</div><div><br /></div><div>The counters will need to be reset at the start of the minute. I first made the mistake of using the expiry time of the key set to 1 minute, but realized that this introduces a race at the point of aggregating the count on to the sorted set. Which is that we may be unlucky that before the aggregation, redis could have expired the key, resulting in a substantial undercount in the set. (This was a difficult bug to track down, and of course the most obvious, I had a face-palm moment as you can imagine.)</div><div><br /></div><div>There is a slight difficulty we need to work around here w.r.t the sorted set. If we keep the count as the element in the set, a count that happens to be the same as one already stored will replace the previous count (with the score modified). Since we are using the timestamp as the score, this will essentially remove the entry we had for the earlier timestamp. This is how sets work after all - it is a data structure suited for keeping unique members. But we can easily overcome this by prepending the timestamp of the count to the count and storing that as the element of the set. To read the count, we merely have to split the element by the delimiter (we used the colon here for the delimited, which is somewhat of a standard in redis for splitting keys), and use the second part.</div><div><br /></div><h2 style="text-align: left;">A look at the running counters</h2><div>A ZRANGE command can retrieve the counts like so:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijjMv2GgHx0j6c_ySyW5LQNkLC4iXgcPkGnzdIYRiy6EdXK-pgb0h0BMF6WLZqStNno09SrRkpR24iQdndI0pUc4Md6wOSZD-WQa4m1nPTeNU4dhkiJBQFWGRUJhrOLVtbljg69qVHs70V72Dx4yoJnT8hBvhARODiO4rSL4ygqngJyPAXNP8/s1342/Screen%20Shot%202022-08-21%20at%206.59.19%20PM.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1342" height="429" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijjMv2GgHx0j6c_ySyW5LQNkLC4iXgcPkGnzdIYRiy6EdXK-pgb0h0BMF6WLZqStNno09SrRkpR24iQdndI0pUc4Md6wOSZD-WQa4m1nPTeNU4dhkiJBQFWGRUJhrOLVtbljg69qVHs70V72Dx4yoJnT8hBvhARODiO4rSL4ygqngJyPAXNP8/w640-h429/Screen%20Shot%202022-08-21%20at%206.59.19%20PM.png" width="640" /></a></div><br /><div><br /></div><div><br /></div><div><br /></div><p></p><div><br /></div><div><br /></div><div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div>Counting the unique visitors is only slightly more involved. First we need to keep a regular Redis set and update it for each request. In our case, the user id is encoded in the request header, we decode it and add it to the Redis set. Now if the same user visits the site again, since we are using a set, we will not add it again, and we still have just one element in the set. This way, we can take the length of the set at any point and know how many unique visitors we have from the time we started writing to the set.</div></div><p></p><div>The only thing left to do is, create the set at the start of the time interval we need to measure the count, and reset it at the end of the time interval. So we can set this up to reset every minute for a minute by minute count of unique visitors. Then we can use the infrastructure we built above to aggregate the count over to the sorted set, so we have a running count of unique visitors for the past N minutes.</div><div><br /></div><div>(You may have a different technique for figuring out the ID of the user making the request. Once the ID is extracted, you can use a Redis set to keep track of the visit.)</div><div><br /></div><div>Here is how we see the unique visitor count changing dynamically:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj_SLxZ36R-gmmRu4ulSIIPvGF2ITghSjo4p6aCpWZZ02Aq8OFuXFoAvv6NK95qhV9kN2qincb0RGQ145zIS_j3qK8Ke76gjbDDR6KsNVcVJ6TgZZ72XzsgpeZHp4xpmI7FAFGnOOXfJt0VsSeal9L4NEzKT1KIsDVGO3usj64pA2Ikk18SWc/s1290/Screen%20Shot%202022-08-21%20at%207.07.19%20PM.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="262" data-original-width="1290" height="81" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj_SLxZ36R-gmmRu4ulSIIPvGF2ITghSjo4p6aCpWZZ02Aq8OFuXFoAvv6NK95qhV9kN2qincb0RGQ145zIS_j3qK8Ke76gjbDDR6KsNVcVJ6TgZZ72XzsgpeZHp4xpmI7FAFGnOOXfJt0VsSeal9L4NEzKT1KIsDVGO3usj64pA2Ikk18SWc/w400-h81/Screen%20Shot%202022-08-21%20at%207.07.19%20PM.png" width="400" /></a></div><br /><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div>We can just as easily use another excellent Redis command to see all the user ids in this set. Here is a snippet in our case :</div><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6H4aUUMTEzpmu8LuP8AeaO4DuP9AOGT65vKORhk7oeTFrQfTSHJ6cI18AW1xkvNw8un5lnRdfY8WQqCMa1a2c2CCkZGrA4EVH4fEaWsV-RZaCsdpHtNbxu5FmUePLqvlk53Nl6Spg35eMfgyi8xXBRfvHrnNavgDtafouWq3bYJBgZ2olRTI/s1304/Screen%20Shot%202022-08-21%20at%207.11.38%20PM.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="1304" height="71" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6H4aUUMTEzpmu8LuP8AeaO4DuP9AOGT65vKORhk7oeTFrQfTSHJ6cI18AW1xkvNw8un5lnRdfY8WQqCMa1a2c2CCkZGrA4EVH4fEaWsV-RZaCsdpHtNbxu5FmUePLqvlk53Nl6Spg35eMfgyi8xXBRfvHrnNavgDtafouWq3bYJBgZ2olRTI/s320/Screen%20Shot%202022-08-21%20at%207.11.38%20PM.png" width="320" /></a></div><br /><p><br /></p><p><br /></p><h2 style="text-align: left;"><br /></h2><h2 style="text-align: left;">Implementation</h2><div style="text-align: left;"><span style="font-weight: normal;"><span style="font-family: inherit;"><span style="white-space: pre-wrap;">We implemented the counters using Ruby, with the </span><a href="https://github.com/redis/redis-rb" style="white-space: pre-wrap;" target="_blank">redis gem</a><span style="white-space: pre-wrap;"> as our client. This involves several steps:</span></span></span></div><div><ol style="text-align: left;"><li><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Initializing the counters</span></span></li><li><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Resetting the counters at minute, hour, day intervals</span></span></li><li><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Incrementing the appropriate counters for each request</span></span></li><li><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Aggregating the count onto the set</span></span></li></ol><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">The first two steps can be combined. We used a scheduler that sits within the app via the <a href="https://github.com/jjb/ruby-clock" target="_blank">ruby clock gem</a>. Heroku allows provisioning a single process that runs a scheduler based on the schedule we set via the ruby clock. This is pretty similar to how one would use <a href="https://www.geeksforgeeks.org/crontab-in-linux-with-examples/" target="_blank">cron</a> on a Unix machine to schedule a task.</span></span></div></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Heroku does provide a <a href="https://devcenter.heroku.com/articles/scheduler" target="_blank">scheduler</a> as well. We did not use it as it does not have the same reliability guarantees as the ruby clock gem. I have seen cases where the Heroku scheduler does not fire and fires very late, all <a href="https://devcenter.heroku.com/articles/scheduler#known-issues-and-alternatives" target="_blank">documented</a> behaviors.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Since we use Rails for our app, we utilized its framework built on top of controllers to track request counts.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="font-family: inherit;"><span style="white-space: pre-wrap;">A controller encapsulates serving requests for a specific web route (think of this as having one controller for <i>yoursite.com/user/login</i> and another controller for <i>yoursite.com/reports/account</i>). Now each controller is a subclass of a class we implement called </span><span style="background-color: #1e1e1e; color: #d4d4d4; white-space: pre;">ApplicationController</span><span style="white-space: pre-wrap;"> which itself is a subclass of the Rails class </span><span style="background-color: #1e1e1e; color: #d4d4d4; white-space: pre;">ActionController::Base</span><span style="white-space: pre-wrap;">.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Rails allows us to hook all requests at the ApplicationController level with a simple <i>before_action</i> hook. We implemented the request counting using this hook, and it looks something like this:</span></span></div><div><span face="arial, sans-serif"><span style="white-space: pre-wrap;"><br /></span></span></div><div><blockquote><span style="white-space: pre-wrap;"><span style="font-family: courier;">class ApplicationController < ActionController::Base
before_action :update_usage_counters
def update_usage_counters
PerfCounters.increment(user_id: current_user&.id, request_path: request&.fullpath)
end
end
</span></span></blockquote><div style="font-family: arial, sans-serif;"><span style="white-space: pre-wrap;"><br /></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Now each request goes through update_usage_counters, which delegates the work to the PerfCounters class we wrote. <i>request</i> is an object provided by the Rails routing framework, and <i>request.fullpath</i> contains the URL. The method <i>current_user</i> (not shown) extracts the logged in user's ID from the request headers.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">I will reproduce pieces of a simplified version of <i>PerfCounters</i> that will illustrate the logic:</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">The incrementing logic looks like this:</span></span></div><div><blockquote><span style="white-space: pre-wrap;"><span style="font-family: courier;">class PerfCounters
def self.increment(user_id:, request_path:)
$redis.pipelined do |pipeline|
if user_id.present?
pipeline.incr('USER_REQUESTS_MINUTE')
pipeline.sadd('UNIQUE_VISITORS_MINUTE', user_id)
if request_path&.include?("/geo/send")
pipeline.incr('GEO_REQUESTS_MINUTE')
end
else
pipeline.incr('OTHER_REQUESTS_MINUTE')
end
end
end
end
</span></span></blockquote></div><div style="font-family: arial, sans-serif;"><span style="white-space: pre-wrap;"><br /></span></div></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Notice that a request made on behalf of a logged in user will have <i>user_id</i> parameter set. The <i>request_path</i> is the path of the URL, and we use it here to separate the counts made to track the location of the user.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Another neat redis feature we use here is <a href="https://redis.io/docs/manual/pipelining/" target="_blank">pipelining</a>. The idea is that if we need to make a number of independent requests to redis, we can open a socket to the redis server and send all that data and close the socket at the end. Redis server will return an array of replies in order after it processes all requests. This is a powerful feature that is more efficient than creating a socket for each separate request. It is not without cost - as the server has to buffer each request, technically blocking the request thread from processing other requests. The rule of thumb is to make sure that each request is pretty fast - O(1) would be ideal, and to not pipeline too many requests in a single call. As with everything, you must test this against all the other traffic you serve and compromise if you must!</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="font-family: inherit;"><span style="white-space: pre-wrap;">Also notice that we are demonstrating the use of three counters </span><span style="white-space: pre-wrap;">USER_REQUESTS_MINUTE, </span><span style="white-space: pre-wrap;">GEO_REQUESTS_MINUTE and </span><span style="white-space: pre-wrap;">OTHER_REQUESTS_MINUTE, alongside a set called </span><span style="white-space: pre-wrap;">UNIQUE_VISITORS_MINUTE. This last one actually keeps the user ids of all visitors. The <a href="https://redis.io/commands/sadd/" target="_blank">sadd</a> command adds the visitor id to the set upon the first time we see them.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">The ruby clock gem takes its inputs via a file named <i>Clockfile</i>. This is in fact a file that uses ruby syntax, i:e it is evaluated by the ruby interpreter. All we do is define the aggregator to run every minute, like so:</span></span></div><div><span face="arial, sans-serif" style="white-space: pre-wrap;"><br /></span></div><div><blockquote><span style="white-space: pre-wrap;"><span style="font-family: courier;">schedule.cron '* * * * *' do
PerfCounters.aggregate_minute
end</span></span><span face="arial, sans-serif" style="white-space: pre-wrap;">
</span></blockquote></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">This is what the minute aggregation looks like:</span></span></div><div><blockquote><span style="white-space: pre-wrap;"><span style="font-family: courier;">def self.aggregate_minute
tm_obj = Time.current - 1.minute # aggregate last minute's stats
tm = tm_obj.to_i
# get all current minute counters, add them to the hour list before zeroing them out
user_rpm, other_rpm, geo_rpm, unique_visitors_last_minute =
$redis.pipelined do |pipeline|
pipeline.get('USER_REQUESTS_MINUTE')
pipeline.get('OTHER_REQUESTS_MINUTE')
pipeline.get('GEO_REQUESTS_MINUTE')
pipeline.scard('UNIQUE_VISITORS_MINUTE')
end
$redis.pipelined do |pipeline|
# ZADD key score value : keep timestamp as score so we get the counters sorted by time
# append the timestamp to the counter to make sure entries don't overwrite.
pipeline.zadd('USER_REQUESTS_LAST_HOUR', tm, "#{user_rpm}:#{tm}")
pipeline.zadd('OTHER_REQUESTS_LAST_HOUR', tm, "#{other_rpm}:#{tm}")
pipeline.zadd('GEO_REQUESTS_LAST_HOUR', tm, "#{geo_rpm}:#{tm}")
pipeline.zadd('UNIQUE_VISITORS_LAST_HOUR', tm, "#{unique_visitors_last_minute}:#{tm}")
pipeline.del('USER_REQUESTS_MINUTE')
pipeline.del('OTHER_REQUESTS_MINUTE')
pipeline.del('GEO_REQUESTS_MINUTE')
pipeline.del('UNIQUE_VISITORS_MINUTE')
end
end</span></span><span face="arial, sans-serif" style="white-space: pre-wrap;">
</span></blockquote></div><div><span style="font-family: inherit;"><span style="white-space: pre-wrap;">As you can see, there are two types of counters. One tracks the count for each minute, the other aggregates it for the hour. So for example, take </span><span style="white-space: pre-wrap;"><i>USER_REQUESTS_MINUTE</i>. This is incremented for each request made on behalf of a logged in user. Then upon the dawn of the minute, this counter is added to the sorted set </span><span style="white-space: pre-wrap;"><i>USER_REQUESTS_LAST_HOUR</i> and then immediately deleted.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">You can chop the aggregations every Mth hour since otherwise these sets will keep growing eventually taking all of Redis memory! I won't show that code, but it should be fairly straightforward to implement.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">After having implemented this solution and writing this article, I have come to see that there are other ways to implement counting using Redis. Redis provides such a versatile set of data structures and algorithms that there is always a simpler or more elegant technique somewhere!</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">For example, when we use a container like a sorted set or a list, we must set its bounds, clearing it at certain time intervals and thus restricting its memory usage. But if you use the <a href="https://redis.io/docs/stack/" target="_blank">Redis stack</a>, there is an excellent data structure - the Redis <a href="https://redis.io/docs/stack/timeseries/" target="_blank">Timeseries</a> that does much of this bookkeeping for you. Basically, you can configure the time series to expire old entries (something that most other Redis data structures do not do for you - you can expire the complete key or nothing at all). Besides that it has commands very similar to the set or the sorted set.</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">Another advantage of a time series vs a sorted set would be the trivially simple management of a "rolling" window of counts. This is typical in performance monitoring that you want the "last 72 hours" or the last "30 days" of performance data, which is more useful than data "for all of today" or "for the current hour".</span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;"><br /></span></span></div><div><span style="white-space: pre-wrap;"><span style="font-family: inherit;">I leave this as an exercise to the reader. Maybe I can talk about this in greater detail in a future article as well!</span></span></div><span style="font-family: inherit;"><span></span></span><div><br /></div><span><a name='more'></a></span><div><br /></div><p><span face="arial, sans-serif" style="font-size: small; white-space: pre-wrap;">This post is in collaboration with Redis.</span></p><p><span face="arial, sans-serif" style="font-size: small; font-weight: 700; white-space: pre-wrap;">Learn more:</span></p><div style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"><span id="m_728717110910178134m_5025674836303415239gmail-m_7284779258426388204gmail-m_-3702633914473850678gmail-m_8952040696209478705gmail-m_-1813321305228359964gmail-m_1262458826033144780gmail-m_-7682981764392618003gmail-docs-internal-guid-ffb92f76-7fff-b50f-4503-7982e6d77dea"><span face="arial, sans-serif"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><br /></p><ul style="margin-bottom: 0px; margin-top: 0px;"><li dir="ltr" style="background-color: transparent; color: black; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; margin-left: 36pt; vertical-align: baseline; white-space: pre-wrap;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a data-saferedirecturl="https://www.google.com/url?q=https://redis.info/3NBGJRT&source=gmail&ust=1660847741884000&usg=AOvVaw3OvlJHE4walg8cPTDs6_EZ" href="https://redis.info/3NBGJRT" style="color: #1155cc; text-decoration-line: none;" target="_blank"><span style="background-color: transparent; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; vertical-align: baseline;">Try Redis Cloud for free</span></a></p></li><li dir="ltr" style="background-color: transparent; color: black; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; margin-left: 36pt; vertical-align: baseline; white-space: pre-wrap;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a data-saferedirecturl="https://www.google.com/url?q=https://redis.info/3Ga9YII&source=gmail&ust=1660847741884000&usg=AOvVaw3WFQ5HF6OUmX0bUxd401Fk" href="https://redis.info/3Ga9YII" style="color: #1155cc; text-decoration-line: none;" target="_blank"><span style="background-color: transparent; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; vertical-align: baseline;">Watch this video on the benefits of Redis Cloud over other Redis providers</span></a></p></li><ul style="margin-bottom: 0px; margin-top: 0px;"><li dir="ltr" style="background-color: transparent; color: black; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; margin-left: 36pt; vertical-align: baseline; white-space: pre-wrap;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="background-color: transparent; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;">(Embed this video if possible)</span></p></li></ul><li dir="ltr" style="background-color: transparent; color: black; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; margin-left: 36pt; vertical-align: baseline; white-space: pre-wrap;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a data-saferedirecturl="https://www.google.com/url?q=https://redis.info/3LC4GqB&source=gmail&ust=1660847741884000&usg=AOvVaw2kSKn9ughOQAfpSuz-kRW7" href="https://redis.info/3LC4GqB" style="color: #1155cc; text-decoration-line: none;" target="_blank"><span style="background-color: transparent; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; vertical-align: baseline;">Redis Developer Hub - tools, guides, and tutorials about Redis</span></a></p></li><li dir="ltr" style="background-color: transparent; color: black; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; margin-left: 36pt; vertical-align: baseline; white-space: pre-wrap;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><a data-saferedirecturl="https://www.google.com/url?q=https://redis.info/3wMR7PR&source=gmail&ust=1660847741884000&usg=AOvVaw1yy4M9iSMCYouDeG8n7XDv" href="https://redis.info/3wMR7PR" style="color: #1155cc; text-decoration-line: none;" target="_blank"><span style="background-color: transparent; font-variant-east-asian: normal; font-variant-numeric: normal; text-decoration-line: underline; vertical-align: baseline;">RedisInsight Desktop GUI</span></a></p></li></ul></span></span></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-44285720954612000472020-12-18T11:27:00.001-08:002020-12-18T11:27:26.555-08:00Set GIT Hash on Heroku Deploys<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-WdmWzxgYCXI/X90BvSXaCXI/AAAAAAAAD5o/i9VOIvIUh8MCtTJHYftwS0N1PPcztRPlACNcBGAsYHQ/s200/git.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="200" data-original-width="200" src="https://1.bp.blogspot.com/-WdmWzxgYCXI/X90BvSXaCXI/AAAAAAAAD5o/i9VOIvIUh8MCtTJHYftwS0N1PPcztRPlACNcBGAsYHQ/s0/git.png" /></a></div><br />There are <a href="https://stackoverflow.com/questions/14583282/heroku-display-hash-of-current-commit-in-browser" target="_blank">several ways to set the GIT hash for heroku</a>. I prefer setting this just before pushing the main branch to the heroku remote.<p></p><p>Also, I prefer to automate this with a `pre-push` hook.</p><p>One issue I ran into was that it was not straightforward to know if there is anything to push. Meaning, if the remote was up-to-date or not. This is important, as we don't want heroku deploying a version when there are no changes to be deployed (which it will if we set the APP_VERSION).</p><p>The simplest way to know if the remote was already up-to-date seemed to be to do a `--dry-run` of `git push`.</p><p>But of course that runs the hook again, so this is an endless loop situation.</p><p>It does not seem that git allows us to see the git argument on the hook. If it did, we could explicitly not run the hook on a dry run.</p><p>But there is a `--no-verify` option that bypasses the hooks, and can be used when we do our dry run.</p><p>Here is the complete script:</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/--Sr-lPL_4XA/X90BfKj5GfI/AAAAAAAAD5g/mVKn830QDFM_6it_MsL-CipKVT7dkNgwACNcBGAsYHQ/s1422/Screen%2BShot%2B2020-12-18%2Bat%2B11.22.18%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1422" data-original-width="1370" height="320" src="https://1.bp.blogspot.com/--Sr-lPL_4XA/X90BfKj5GfI/AAAAAAAAD5g/mVKn830QDFM_6it_MsL-CipKVT7dkNgwACNcBGAsYHQ/s320/Screen%2BShot%2B2020-12-18%2Bat%2B11.22.18%2BAM.png" /></a></div><br /><p><br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-63512921191674528522019-11-30T17:47:00.001-08:002019-12-01T10:33:05.950-08:00Rails database migrations on a cluster with AWS CodeDeploy<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-iHhl_iUAbcc/XeMbn2eegMI/AAAAAAAACZ4/cFMvcba8PhwhP8vx9CQn99yjf-Ax-NTsQCNcBGAsYHQ/s1600/ror.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="200" data-original-width="200" src="https://1.bp.blogspot.com/-iHhl_iUAbcc/XeMbn2eegMI/AAAAAAAACZ4/cFMvcba8PhwhP8vx9CQn99yjf-Ax-NTsQCNcBGAsYHQ/s1600/ror.png" /></a></div>
A typical Ruby / Rails solution will comprise of a number of web servers behind a load balancer. Each of the web servers will read/write from a central database. In the course of new features being added to the application, the database schema goes through changes, what we refer to as "migrations".<br />
<br />
When new code is deployed, the migrations that are needed for new code needs to be run first. If AWS Code Deploy is used for deployment of new code, we can set up the <i>AfterInstall</i> hooks to run the migrations before re-starting the web server.<br />
<br />
So the usual flow in a deployment goes something like this:<br />
<br />
<ol>
<li>Stop the web server</li>
<li>Migrate the database</li>
<li>Start the web server</li>
</ol>
<br />
However, our application is hosted on a number of web servers. We don't want to bring down all servers at once. A typical blue/green deployment will have us deploy to just one third of the server fleet at once.<br />
<br />
So if we have 27 web servers, we will be running the above steps on 9 of them at the same time. The main problem with this is that when the Rails migrate runs on multiple servers at once, it is likely to fail on a number of them. This is because Rails takes an advisory lock on the database and throws an exception on concurrent migrations. You can read more about the advisory locking <a href="https://nebulab.it/blog/the-strange-case-of-activerecord-concurrentmigrationerror/" target="_blank">here </a>as well as a way to work around the problem.<br />
<br />
But the solution is not without its drawbacks. If you prevent the migrations running on all but one machine, it is possible that the new code will be deployed sooner on those machines before the migration has finished. This is specially true for long running migrations. Then there is potential for new code to be running against an old database schema. New features that depend on the new schema will likely fail.<br />
<br />
A better solution would be:<br />
<ol>
<li>Run the migration on a single instance - this could be one of the web servers, or a dynamically provisioned EC2 instance that can access the database.</li>
<li>For all web servers</li>
</ol>
2.1 Stop the server<br />
2.2 Deploy new code to it<br />
2.3 Start the server<br />
<ul>
</ul>
<div>
The advantage of this solution is</div>
<div>
<ol>
<li>We side-step the concurrent migration issue. We run the migrations on a single instance and then do the rest of the deployment without incurring any database issues.</li>
<li>We bring up the new code only after the database is migrated, so the new features work reliably from the start.</li>
</ol>
</div>
<div>
So the new database schema changes need to be backward compatible. But this is a general constraint we have anyway since on a blue/green deployment some part of the code is old and will be hitting the new database.</div>
<div>
<br /></div>
<div>
While this solution is pretty straightforward, it requires some effort to implement this in the AWS CodeDeploy environment.</div>
<div>
<br /></div>
<div>
What I ended up doing was to use a new Deployment Group (called <i>staging</i>) to bring up a single EC2 instance, change the start up code to only run the migration on that deployment group. Then I hooked this deployment group right after the deployment to a test instance, but before the code is deployed to the production servers.</div>
<div>
<br /></div>
<div>
In the startup code, we can check the current deployment group via <i>ENV['DEPLOYMENT_GROUP_NAME'].</i> In our scripts, we set the <i>RAILS_ENV</i> equal to the Deployment Group. This allows code to take different paths based on where it runs (in a local dev environment, a staging server or like in this case on a <i>migrator</i> server).<br />
<br />
This is what our migrate script now looks like:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-H47hHL2St_E/XeMX10cYfoI/AAAAAAAACY8/dQbcN61VJvcMKxYPNYJ9Ylj8VQmrkk4GQCNcBGAsYHQ/s1600/Screen%2BShot%2B2019-11-30%2Bat%2B5.30.44%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="186" data-original-width="756" height="78" src="https://1.bp.blogspot.com/-H47hHL2St_E/XeMX10cYfoI/AAAAAAAACY8/dQbcN61VJvcMKxYPNYJ9Ylj8VQmrkk4GQCNcBGAsYHQ/s320/Screen%2BShot%2B2019-11-30%2Bat%2B5.30.44%2BPM.png" width="320" /></a></div>
<br />
It is important to set the inequality, as we want the migrations to run on our test servers - we just don't want them running on the production web servers.<br />
<br />
We add this to our database.yml, notice the environment is <i>staging</i>, to match the deployment group. Notice the database is the production instance.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-w-EziehRAh0/XeMYcl-DBbI/AAAAAAAACZE/SxSJhVotYsgMNh4yiVPfwgjgN7s613EygCNcBGAsYHQ/s1600/Screen%2BShot%2B2019-11-30%2Bat%2B5.32.57%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="360" data-original-width="1520" height="75" src="https://1.bp.blogspot.com/-w-EziehRAh0/XeMYcl-DBbI/AAAAAAAACZE/SxSJhVotYsgMNh4yiVPfwgjgN7s613EygCNcBGAsYHQ/s320/Screen%2BShot%2B2019-11-30%2Bat%2B5.32.57%2BPM.png" width="320" /></a></div>
<br />
In our case, we read credentials from AWS Secrets Manager. You don't have to.<br />
<br />
This is how our <i>staging</i> step in CodePipleline looks like:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-2zKTt3F0QUQ/XeMaJmo4aLI/AAAAAAAACZk/s5lWPGnTMYANJeTqdWMw-_QWIH_8R0YZQCNcBGAsYHQ/s1600/Screen%2BShot%2B2019-11-30%2Bat%2B5.40.12%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1154" data-original-width="1196" height="308" src="https://1.bp.blogspot.com/-2zKTt3F0QUQ/XeMaJmo4aLI/AAAAAAAACZk/s5lWPGnTMYANJeTqdWMw-_QWIH_8R0YZQCNcBGAsYHQ/s320/Screen%2BShot%2B2019-11-30%2Bat%2B5.40.12%2BPM.png" width="320" /></a></div>
<br />
On the last CodeDeploy step, hit the edit button and set the Application Name and the Deployment Group correctly.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-6duEk7oczgM/XeMbAdvkyYI/AAAAAAAACZs/qSVlQ2uCPZIOgBioAnFLgC0FTUz3v5veQCNcBGAsYHQ/s1600/Screen%2BShot%2B2019-11-30%2Bat%2B5.42.44%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="223" data-original-width="1600" height="44" src="https://1.bp.blogspot.com/-6duEk7oczgM/XeMbAdvkyYI/AAAAAAAACZs/qSVlQ2uCPZIOgBioAnFLgC0FTUz3v5veQCNcBGAsYHQ/s320/Screen%2BShot%2B2019-11-30%2Bat%2B5.42.44%2BPM.png" width="320" /></a></div>
<br /></div>
<div>
<br />
Now before the code is deployed to the production servers, the database migration has been completed on the <i>staging</i> instance. If the migration fails, CodeDeploy won't advance to the deploy stage. When the production servers start with new code, they will all use the new database schema.<br />
<br />
After the migration has finished and before code is deployed, the old code will start using the new database schema. As long as the new schema is backward compatible, this will not cause a problem.<br />
<br />
You may have to run the release pipline a few times till AWS co-operates with the new changes. But it should eventually start working.</div>
<div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-33696693194344146762019-07-08T23:33:00.003-07:002019-07-08T23:45:48.510-07:00Kinesis/Firehose/Athena - Creating queryable data from logs<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-hN5024Mg2xo/XSQ1lFe4RgI/AAAAAAAACIE/Im8EPT2eB3wdbXYuaWexCGf-ZeRPEtjFwCLcBGAs/s1600/firehose.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="1000" height="160" src="https://1.bp.blogspot.com/-hN5024Mg2xo/XSQ1lFe4RgI/AAAAAAAACIE/Im8EPT2eB3wdbXYuaWexCGf-ZeRPEtjFwCLcBGAs/s320/firehose.jpg" width="320" /></a></div>
Amazon has a data transformation pipeline that allows log data to be queried with a SQL like syntax. This can be useful to gain insight that is buried in log data generally thought of as temporary. When was the last time you went over 6 months of logs? Right, just what I thought.<br />
<div>
<br /></div>
<div>
Wading through logs is painful and with the growth of data all the more so. No wonder that when confronted with the task of gleaning information from past data, engineers build specialized table structures with relational queries in mind or provision specialized map/reduce jobs to crunch over the data for detailed answers to specific questions.</div>
<div>
<br /></div>
<div>
But this time consuming exercise can be done away with by using the Amazon Kinesis pipeline. The flow looks something like this -> The application writes a JSON formatted record that captures a particular item of interest to a Kinesis data stream. A Firehose instance is attached to the output of the data stream. This Firehose instance converts the data to a "JSON like" format and writes them into a S3 bucket at a specified folder. Another Amazon service Glue provides a crawler that can then process new files that get uploaded to the S3 bucket. The Glue crawler infers the schema from the JSON it finds in the S3 files and creates a Glue table. To query the data, Amazon provides yet another service - Athena, which sports a SQL syntax and a user friendly query console. Phew, yeah, it is quite the mother of all pipelines.</div>
<div>
<br /></div>
<div>
This is all pretty straightforward to set up starting from Kinesis console itself. You should start with the Data streams tab in Kinesis, create a data stream, then create a Kinesis Firehose with source equal to the data stream you just created. Specify that firehose data will be written with the API like so:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-UaEikOuyqkc/XSQpqCKxS2I/AAAAAAAACGI/AJowAvEVnKQuoo6IZA_tXmoJ3WP216l8QCLcBGAs/s1600/Screen%2BShot%2B2019-07-08%2Bat%2B10.43.24%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="198" data-original-width="1600" height="48" src="https://1.bp.blogspot.com/-UaEikOuyqkc/XSQpqCKxS2I/AAAAAAAACGI/AJowAvEVnKQuoo6IZA_tXmoJ3WP216l8QCLcBGAs/s400/Screen%2BShot%2B2019-07-08%2Bat%2B10.43.24%2BPM.png" width="400" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Since we are writing JSON to Kinesis, there is no need to convert record format, and we will use the data as is without transformation to Firehose, well more on this later, but we can leave the default settings for <span style="background-color: white; color: #444444; font-family: "helvetica neue" , "roboto" , "arial" , sans-serif; font-size: 14px; font-weight: 700; text-align: right;">Source record transformation</span> and <span style="background-color: white; color: #444444; font-family: "helvetica neue" , "roboto" , "arial" , sans-serif; font-size: 14px; font-weight: 700; text-align: right;">Record format conversion</span>. </div>
<div>
<br /></div>
<div>
Finally you need to specify where you want this data to live, in our case S3:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-V3M5YKlSfFY/XSQrRcRbkqI/AAAAAAAACGU/UpFNLpr05VA5O45ieHyJtKH2GzM1Un5gwCLcBGAs/s1600/Screen%2BShot%2B2019-07-08%2Bat%2B10.50.06%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="325" data-original-width="1600" height="81" src="https://1.bp.blogspot.com/-V3M5YKlSfFY/XSQrRcRbkqI/AAAAAAAACGU/UpFNLpr05VA5O45ieHyJtKH2GzM1Un5gwCLcBGAs/s400/Screen%2BShot%2B2019-07-08%2Bat%2B10.50.06%2BPM.png" width="400" /></a></div>
<div>
<br /></div>
<div>
Now head over to Glue and add a crawler. Specify the S3 folder that you used above for the "include path". For simplicity, I would start with a single schema for all S3 records, under "Grouping Behaviors".</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-VhvsVziySZw/XSQsQW7dfNI/AAAAAAAACGs/Ej6-MW-MAW0lIjlnYPqHFthE_r3DDG77wCLcBGAs/s1600/Screen%2BShot%2B2019-07-08%2Bat%2B10.54.05%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="312" data-original-width="1600" height="77" src="https://1.bp.blogspot.com/-VhvsVziySZw/XSQsQW7dfNI/AAAAAAAACGs/Ej6-MW-MAW0lIjlnYPqHFthE_r3DDG77wCLcBGAs/s400/Screen%2BShot%2B2019-07-08%2Bat%2B10.54.05%2BPM.png" width="400" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Now head over to your favorite editor and let's write some code - finally!</div>
<div>
It's up to you how you want to structure the code to do this. In the application I'm building, it is literally the logs that I want sent over to Kinesis. Anticipating this, I wrote a function that the app calls for writing logs, and this function was the ideal place to add in the write to Kinesis. It looks something like this:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-DcaeZEuESqc/XSQt5O_gPjI/AAAAAAAACG4/yxBDDhTiqLMF84RwoqTD9E2FMkYvmtseACLcBGAs/s1600/Screen%2BShot%2B2019-07-08%2Bat%2B11.01.57%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="562" data-original-width="1600" height="224" src="https://1.bp.blogspot.com/-DcaeZEuESqc/XSQt5O_gPjI/AAAAAAAACG4/yxBDDhTiqLMF84RwoqTD9E2FMkYvmtseACLcBGAs/s640/Screen%2BShot%2B2019-07-08%2Bat%2B11.01.57%2BPM.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
That will be all to it, except there is an annoying<a href="https://forums.aws.amazon.com/thread.jspa?threadID=244858" target="_blank"> bug in the pipeline that we need to work around</a>. The issue is that Firehose writes "JSON like" data to S3 that is all a single line. The Glue crawler expects each record to exist in a single line. So when all the records are squished into a single line in S3, the crawler processes the first and throws away the rest. Imagine my surprise when only 1 out of 17 of my log records appeared in the Athena queries.</div>
<div>
<br /></div>
<div>
The workaround is to write a Lambda function with a Kinesis trigger. What this does is that every time a Kinesis record is written, the Lambda gets triggered. Well, that is not strictly true - Kinesis will batch a bunch of records and invoke the lambda once per batch. The batch size (or time for trigger) can be specified from the console.</div>
<div>
<br /></div>
<div>
Or if you are using <a href="https://serverless.com/framework/docs/providers/aws/guide/quick-start/" target="_blank">serverless</a>, this can be specified in the serverless.yml like so:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-hu9m6DvVY8o/XSQyYernAGI/AAAAAAAACHc/QAMboiWEwJ490Qh4ARCZz6nQwcr4iab3QCLcBGAs/s1600/Screen%2BShot%2B2019-07-08%2Bat%2B11.21.08%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="370" data-original-width="1004" height="234" src="https://1.bp.blogspot.com/-hu9m6DvVY8o/XSQyYernAGI/AAAAAAAACHc/QAMboiWEwJ490Qh4ARCZz6nQwcr4iab3QCLcBGAs/s640/Screen%2BShot%2B2019-07-08%2Bat%2B11.21.08%2BPM.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Without further ado, here's the Lambda that adds the newline:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-0lXzc0zvZBc/XSQzOZui5LI/AAAAAAAACHs/2eKbmMBGzHINT3Pt6OoU8LJdVurlOhScgCLcBGAs/s1600/Screen%2BShot%2B2019-07-08%2Bat%2B11.24.38%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1282" data-original-width="1008" height="640" src="https://1.bp.blogspot.com/-0lXzc0zvZBc/XSQzOZui5LI/AAAAAAAACHs/2eKbmMBGzHINT3Pt6OoU8LJdVurlOhScgCLcBGAs/s640/Screen%2BShot%2B2019-07-08%2Bat%2B11.24.38%2BPM.png" width="502" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
<br /></div>
<div>
This is written in node.js, and I used the serverless framework with the node.js template to write it. I'm exporting a single function named newlines. This is triggered when there is a batch of records in the Kinesis data stream. We map over the records, transforming each record by adding a new line. This is done in the add_new_line function.</div>
<div>
<br /></div>
<div>
To let the node engine know what we did, we use the <i>callback</i>. It is standard node.js to pass an error object for errors and null when there are no errors (we succeeded).</div>
<div>
<br /></div>
<div>
<i>firehose.putBatchRecord</i> is for efficiency - we could just as well have used <i>firehose.PutRecord</i> and the results would be the same besides throughput.</div>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-11265228.post-61973557686049412372018-06-04T15:26:00.000-07:002018-06-04T15:26:36.200-07:00JSON to objects in a few languages<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-GL9s7r-Nq0I/WxW6Jse8lDI/AAAAAAAABq0/stSlJsLFllAGlt96ERLJo16AgADkzre8ACLcBGAs/s1600/json.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="120" data-original-width="120" height="200" src="https://2.bp.blogspot.com/-GL9s7r-Nq0I/WxW6Jse8lDI/AAAAAAAABq0/stSlJsLFllAGlt96ERLJo16AgADkzre8ACLcBGAs/s200/json.png" width="200" /></a></div>
When working with data services, we often have a need to convert JSON strings to objects that model our data. Here is a list of code snippets in different languages that convert this <a href="https://developers.facebook.com/docs/graph-api/using-graph-api/">Facebook Graph JSON data</a>.<br />
<br />
The list is presented in ascending order of verbosity. Predictably, Javascript most succinctly expresses its wishes, whereas Java uses a copious amount of code. Scala avoids some of that verbosity by using case classes that remove boilerplate code for constructors. Jackson (Java) requires getters and setters to identify which attributes of the object are to be serialized, causing code bloat.<br />
<br />
JSON:<br />
<script src="https://gist.github.com/thushw/e0415703669d53b7ba8527e96d636067.js"></script> Javascript:<br />
<script src="https://gist.github.com/thushw/34a6ebdaa71ee9b493442f6bb97a0684.js"></script> Ruby:<br />
<script src="https://gist.github.com/thushw/40e425e31c83a04718b0e90e936a048f.js"></script> Python:<br />
<script src="https://gist.github.com/thushw/c29003865097dfa8f9b14d56e2edd0ef.js"></script> Scala:<br />
<script src="https://gist.github.com/thushw/407303dd34bca921132a70eb5bb45ca3.js"></script> Java:<br />
<script src="https://gist.github.com/thushw/ff665100f2577e13e4a43a71e2e17b8b.js"></script>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-11265228.post-67682960408984383132018-04-20T13:10:00.000-07:002018-04-20T17:24:17.791-07:00Goldbach ConjectureIn the 18th century, two mathematicians came up with a conjecture - known by its original creator - named Goldbach conjecture. It says that any even number greater than 2 can be expressed as a sum of two primes. There is no theoretical proof for this yet, but it is said to <a href="https://artofproblemsolving.com/wiki/index.php?title=Goldbach_Conjecture" target="_blank">hold up to 400 trillion</a>.<br />
<br />
<br />
A program to test Golbach conjecture for a given integer:<br />
<br />
<script src="https://gist.github.com/thushw/fdd2248c5f1c03d579d51e2c923f8241.js"></script> This program demonstrates two algorithms that are well known.<br />
<br />
<br />
<ol>
<li>The <a href="https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes">sieve of Eratosthenes</a> to calculate all primes upto a given number </li>
<li>A linear algorithm to find if two numbers in a list sum to a given number.</li>
</ol>
<br />
<br />
To prove the Goldbach conjecture for a given <i>n</i>, we use the sieve to find all prime numbers up to <i>n</i>, then use the linear algorithm to find two primes from this list that sums up to <i>n</i>.<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-50865176357826991362018-04-06T15:58:00.001-07:002018-04-06T16:02:54.924-07:00Timing with Jupyter notebookPieces of code can be timed within the Jupyter notebook using the <a href="http://pynash.org/2013/03/06/timing-and-profiling/">%timeit magic</a>.<br />
<br />
Here is an example where a grid walk algorithm is implemented three times with progressively better run time, timed with %timeit and graphed using bokeh:<br />
<br />
Code:<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47</pre>
</td><td><pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">num_paths</span>(n):
M <span style="color: #333333;">=</span> [[<span style="color: #0000dd; font-weight: bold;">0</span>] <span style="color: #333333;">*</span> n <span style="color: #008800; font-weight: bold;">for</span> i <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">range</span>(n)]
<span style="color: #008800; font-weight: bold;">for</span> i <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">range</span>(n):
M[n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>][i] <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">for</span> r <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">range</span>(n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">2</span>, <span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>):
<span style="color: #008800; font-weight: bold;">for</span> c <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">range</span>(n<span style="color: #333333;">-</span>r<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>, n):
M[r][c] <span style="color: #333333;">=</span> M[r][c<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>] <span style="color: #333333;">+</span> M[r<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>][c]
<span style="color: #008800; font-weight: bold;">return</span> M[<span style="color: #0000dd; font-weight: bold;">0</span>][n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>]
<span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">num_paths_from</span>(r, c, n, M):
<span style="color: #008800; font-weight: bold;">if</span> M[r][c] <span style="color: #333333;">></span> <span style="color: #0000dd; font-weight: bold;">0</span>:
<span style="color: #008800; font-weight: bold;">return</span> M[r][c]
<span style="color: #008800; font-weight: bold;">if</span> r <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span> <span style="color: black; font-weight: bold;">and</span> c <span style="color: #333333;">==</span> n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>:
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #0000dd; font-weight: bold;">1</span>
paths <span style="color: #333333;">=</span> ([(x,y) <span style="color: #008800; font-weight: bold;">for</span> (x,y) <span style="color: black; font-weight: bold;">in</span>
[(r<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>, c), (r, c<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>)] <span style="color: #008800; font-weight: bold;">if</span> y <span style="color: #333333;">>=</span> n<span style="color: #333333;">-</span>x<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: black; font-weight: bold;">and</span> y<span style="color: #333333;"><</span>n])
npaths <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>
<span style="color: #008800; font-weight: bold;">for</span> x,y <span style="color: black; font-weight: bold;">in</span> paths:
npaths <span style="color: #333333;">+=</span> num_paths_from(x,y,n,M)
M[r][c] <span style="color: #333333;">=</span> npaths
<span style="color: #008800; font-weight: bold;">return</span> npaths
<span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">num_pathz_from</span>(r, c, n):
<span style="color: #008800; font-weight: bold;">if</span> r <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span> <span style="color: black; font-weight: bold;">and</span> c <span style="color: #333333;">==</span> n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>:
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #0000dd; font-weight: bold;">1</span>
paths <span style="color: #333333;">=</span> ([(x,y) <span style="color: #008800; font-weight: bold;">for</span> (x,y) <span style="color: black; font-weight: bold;">in</span>
[(r<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>, c), (r, c<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>)] <span style="color: #008800; font-weight: bold;">if</span> y <span style="color: #333333;">>=</span> n<span style="color: #333333;">-</span>x<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: black; font-weight: bold;">and</span> y<span style="color: #333333;"><</span>n])
npaths <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>
<span style="color: #008800; font-weight: bold;">for</span> x,y <span style="color: black; font-weight: bold;">in</span> paths:
npaths <span style="color: #333333;">+=</span> num_pathz_from(x,y,n)
<span style="color: #008800; font-weight: bold;">return</span> npaths
<span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">num_paths_slow</span>(n):
M <span style="color: #333333;">=</span> [[<span style="color: #0000dd; font-weight: bold;">0</span>] <span style="color: #333333;">*</span> n <span style="color: #008800; font-weight: bold;">for</span> i <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">range</span>(n)]
<span style="color: #008800; font-weight: bold;">return</span> num_paths_from(n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #0000dd; font-weight: bold;">0</span>, n, M)
<span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">num_paths_super_slow</span>(n):
<span style="color: #008800; font-weight: bold;">return</span> num_pathz_from(n<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #0000dd; font-weight: bold;">0</span>, n)
<span style="color: #008800; font-weight: bold;">for</span> sz <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">range</span>(<span style="color: #0000dd; font-weight: bold;">5</span>,<span style="color: #0000dd; font-weight: bold;">15</span>):
<span style="color: #333333;">%</span>timeit num_paths(sz)
<span style="color: #333333;">%</span>timeit num_paths_slow(sz)
<span style="color: #333333;">%</span>timeit num_paths_super_slow(sz)
</pre>
</td></tr>
</tbody></table>
</div>
<br />
Timing:<br />
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30</pre>
</td><td><pre style="line-height: 125%; margin: 0;"><span style="color: #0000dd; font-weight: bold;">100000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">7.74</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">26.2</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">62.1</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">100000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">9.27</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">32.9</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">200</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">100000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">11.3</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">43</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">1000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">615</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">100000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">13.9</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">56.9</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">100</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">2.05</span> ms per loop
<span style="color: #0000dd; font-weight: bold;">100000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">16.6</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">70.9</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">100</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">6.67</span> ms per loop
<span style="color: #0000dd; font-weight: bold;">100000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">19.4</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">97.4</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">23.7</span> ms per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">22.1</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">105</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">80.2</span> ms per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">25.6</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">135</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">1</span> loop, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">287</span> ms per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">29.8</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">149</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">1</span> loop, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">1.05</span> s per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">32.7</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">10000</span> loops, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #0000dd; font-weight: bold;">171</span> µs per loop
<span style="color: #0000dd; font-weight: bold;">1</span> loop, best of <span style="color: #0000dd; font-weight: bold;">3</span>: <span style="color: #6600ee; font-weight: bold;">3.78</span> s per loop
</pre>
</td></tr>
</tbody></table>
</div>
<br />
Chart:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-qwU-XkjcYCU/Wsf2m29aVpI/AAAAAAAABe4/EArnP8FnxEMIlizajQfG_bGoGi-yiu6QwCLcBGAs/s1600/Screen%2BShot%2B2018-04-06%2Bat%2B3.36.28%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="614" data-original-width="592" height="320" src="https://4.bp.blogspot.com/-qwU-XkjcYCU/Wsf2m29aVpI/AAAAAAAABe4/EArnP8FnxEMIlizajQfG_bGoGi-yiu6QwCLcBGAs/s320/Screen%2BShot%2B2018-04-06%2Bat%2B3.36.28%2BPM.png" width="308" /></a></div>
<br />
Code for the plot: <!-- HTML generated using hilite.me --><br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20</pre>
</td><td><pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">from</span> <span style="color: #0e84b5; font-weight: bold;">bokeh.palettes</span> <span style="color: #008800; font-weight: bold;">import</span> Spectral11
<span style="color: #008800; font-weight: bold;">from</span> <span style="color: #0e84b5; font-weight: bold;">bokeh.plotting</span> <span style="color: #008800; font-weight: bold;">import</span> figure, show, output_file
p <span style="color: #333333;">=</span> figure(plot_width<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">300</span>, plot_height<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">300</span>)
slowest <span style="color: #333333;">=</span> [<span style="color: #0000dd; font-weight: bold;">62</span>,<span style="color: #0000dd; font-weight: bold;">200</span>,<span style="color: #0000dd; font-weight: bold;">615</span>,<span style="color: #0000dd; font-weight: bold;">2050</span>,<span style="color: #0000dd; font-weight: bold;">6670</span>,<span style="color: #0000dd; font-weight: bold;">23700</span>,<span style="color: #0000dd; font-weight: bold;">80200</span>,<span style="color: #0000dd; font-weight: bold;">287000</span>,<span style="color: #0000dd; font-weight: bold;">1050000</span>,<span style="color: #0000dd; font-weight: bold;">3780000</span>]
slower <span style="color: #333333;">=</span> [<span style="color: #0000dd; font-weight: bold;">26</span>,<span style="color: #0000dd; font-weight: bold;">32</span>,<span style="color: #0000dd; font-weight: bold;">43</span>,<span style="color: #0000dd; font-weight: bold;">56</span>,<span style="color: #0000dd; font-weight: bold;">70</span>,<span style="color: #0000dd; font-weight: bold;">97</span>,<span style="color: #0000dd; font-weight: bold;">105</span>,<span style="color: #0000dd; font-weight: bold;">135</span>,<span style="color: #0000dd; font-weight: bold;">149</span>,<span style="color: #0000dd; font-weight: bold;">171</span>]
fast <span style="color: #333333;">=</span> [<span style="color: #0000dd; font-weight: bold;">7</span>,<span style="color: #0000dd; font-weight: bold;">9</span>,<span style="color: #0000dd; font-weight: bold;">11</span>,<span style="color: #0000dd; font-weight: bold;">13</span>,<span style="color: #0000dd; font-weight: bold;">16</span>,<span style="color: #0000dd; font-weight: bold;">19</span>,<span style="color: #0000dd; font-weight: bold;">22</span>,<span style="color: #0000dd; font-weight: bold;">25</span>,<span style="color: #0000dd; font-weight: bold;">29</span>,<span style="color: #0000dd; font-weight: bold;">32</span>]
st <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">5</span>
end <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">8</span>
mypalette<span style="color: #333333;">=</span>Spectral11[<span style="color: #0000dd; font-weight: bold;">0</span>:<span style="color: #0000dd; font-weight: bold;">3</span>]
p<span style="color: #333333;">.</span>multi_line(xs<span style="color: #333333;">=</span>[<span style="color: #007020;">list</span>(<span style="color: #007020;">range</span>(st,end)), <span style="color: #007020;">list</span>(<span style="color: #007020;">range</span>(st,end)), <span style="color: #007020;">list</span>(<span style="color: #007020;">range</span>(st,end))],
ys<span style="color: #333333;">=</span>[slowest[:end<span style="color: #333333;">-</span>st],
slower[:end<span style="color: #333333;">-</span>st],
fast[:end<span style="color: #333333;">-</span>st]
],
line_color<span style="color: #333333;">=</span>mypalette,
line_width<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">5</span>
)
show(p)
</pre>
</td></tr>
</tbody></table>
</div>
<br />
This shows how the algorithm with exponential time complexity deteriorates for higher values of n:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-FTHNUC7rFpM/Wsf4afimZsI/AAAAAAAABfE/R5Qe-RIYwRABJJgo-gKtP5J7lessWNpyACLcBGAs/s1600/Screen%2BShot%2B2018-04-06%2Bat%2B3.43.22%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="628" data-original-width="612" height="320" src="https://3.bp.blogspot.com/-FTHNUC7rFpM/Wsf4afimZsI/AAAAAAAABfE/R5Qe-RIYwRABJJgo-gKtP5J7lessWNpyACLcBGAs/s320/Screen%2BShot%2B2018-04-06%2Bat%2B3.43.22%2BPM.png" width="311" /></a></div>
Now that I've shown you a bunch of performance numbers and visualization, if you are curious about the algorithm, it is a contrived example of finding the number of paths from one corner of a grid to another, here the squares to the north of the diagonal from top right to bottom left are out of bounds - that is, the path is restricted to the right of the diagonal. In this image, we show the problem for n = 5.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-_7aArCWnIi4/Wsf8btj_opI/AAAAAAAABfo/tX0enUN4ZzYCDpwA0qoa8_ofLFUGLf7xwCLcBGAs/s1600/img07.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1600" data-original-width="1600" height="320" src="https://3.bp.blogspot.com/-_7aArCWnIi4/Wsf8btj_opI/AAAAAAAABfo/tX0enUN4ZzYCDpwA0qoa8_ofLFUGLf7xwCLcBGAs/s320/img07.png" width="319" /></a></div>
<br />
<br />
The exponential algorithm recursively finds the number of paths from each point to the end point (the top right corner). But since you can reach a single point by a number of paths (and this number increases exponentially with n), the same computation of finding the number of paths from this point to the grid corner is repeated, causing the slowdown.<br />
<br />
The next improvement is to remember the number of paths once calculated. Say if we are on [4,2], we will calculate the path to the grid end from here and mark it in M[4][2]. Next time we are at [4,2], we no longer need to calculate again, as the result can be looked up from M[4][2].<br />
<br />
The last algorithm uses dynamic programming to do even less work. It works based on the simple observation that a cell (i,j) can only be reached from just 2 cells. Those are the cell to its immediate left, (i,j-1) and the cell right below it, (i+1,j). Then there is just a single path from these two to (i,j). So if we know the number of paths to those two cells, we can add them up to find the number of paths to (i,j). Then we can keep calculating the paths to each cell, walking from bottom row up, going right on the columns and eventually, we will fill the cell at the top right (0, n -1).<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-83762473304462048382018-04-04T17:08:00.000-07:002018-04-05T08:57:15.909-07:00Pandas snippets<div class="tr_bq">
Here are some useful snippets that can come in handy when cleaning data with pandas. This was useful for me in completing the coursework for <a href="https://www.coursera.org/learn/python-data-analysis/home/welcome" target="_blank">python data science course</a>.</div>
<br />
<u>Extract a subset of columns from the dataframe based on a regular expression:</u><br />
Code:<br/>
<!-- HTML generated using hilite.me --><div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><table><tr><td><pre style="margin: 0; line-height: 125%"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24</pre></td><td><pre style="margin: 0; line-height: 125%">persona1 <span style="color: #333333">=</span> pd<span style="color: #333333">.</span>Series({
<span style="background-color: #fff0f0">'Last Post On'</span>: <span style="background-color: #fff0f0">'02/04/2017'</span>,
<span style="background-color: #fff0f0">'Friends-2015'</span>: <span style="color: #0000DD; font-weight: bold">10</span>,
<span style="background-color: #fff0f0">'Friends-2016'</span>: <span style="color: #0000DD; font-weight: bold">20</span>,
<span style="background-color: #fff0f0">'Friends-2017'</span>: <span style="color: #0000DD; font-weight: bold">300</span>
})
persona2 <span style="color: #333333">=</span> pd<span style="color: #333333">.</span>Series({
<span style="background-color: #fff0f0">'Last Post On'</span>: <span style="background-color: #fff0f0">'02/04/2018'</span>,
<span style="background-color: #fff0f0">'Friends-2015'</span>: <span style="color: #0000DD; font-weight: bold">100</span>,
<span style="background-color: #fff0f0">'Friends-2016'</span>: <span style="color: #0000DD; font-weight: bold">240</span>,
<span style="background-color: #fff0f0">'Friends-2017'</span>: <span style="color: #0000DD; font-weight: bold">560</span>
})
persona3 <span style="color: #333333">=</span> pd<span style="color: #333333">.</span>Series({
<span style="background-color: #fff0f0">'Last Post On'</span>: <span style="background-color: #fff0f0">'02/04/2014'</span>,
<span style="background-color: #fff0f0">'Friends-2015'</span>: <span style="color: #0000DD; font-weight: bold">120</span>,
<span style="background-color: #fff0f0">'Friends-2016'</span>: <span style="color: #0000DD; font-weight: bold">120</span>,
<span style="background-color: #fff0f0">'Friends-2017'</span>: <span style="color: #0000DD; font-weight: bold">120</span>
})
df <span style="color: #333333">=</span> pd<span style="color: #333333">.</span>DataFrame([persona1, persona2, persona3],
index<span style="color: #333333">=</span>[<span style="background-color: #fff0f0">'Chris'</span>, <span style="background-color: #fff0f0">'Bella'</span>, <span style="background-color: #fff0f0">'Laura'</span>])
df<span style="color: #333333">.</span>filter(regex<span style="color: #333333">=</span>(<span style="background-color: #fff0f0">"Friends-\d{4}"</span>))
</pre></td></tr></table></div>
<br />
Output:<br />
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Friends-2015</th>
<th>Friends-2016</th>
<th>Friends-2017</th>
</tr>
</thead>
<tbody>
<tr>
<th>Chris</th>
<td>10</td>
<td>20</td>
<td>300</td>
</tr>
<tr>
<th>Bella</th>
<td>100</td>
<td>240</td>
<td>560</td>
</tr>
<tr>
<th>Laura</th>
<td>120</td>
<td>120</td>
<td>120</td>
</tr>
</tbody>
</table>
<blockquote>
<br /></blockquote>
<u>Set a column based on the value of both the current row and adjacent rows:</u><br />
<br />
For this example, we define regulars to the gym as those who have gone to the gym last year at least 3 months in a row:<br />
Code:<br/>
<!-- HTML generated using hilite.me --><div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><table><tr><td><pre style="margin: 0; line-height: 125%"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24</pre></td><td><pre style="margin: 0; line-height: 125%"><span style="color: #008800; font-weight: bold">import</span> <span style="color: #0e84b5; font-weight: bold">datetime</span>
df <span style="color: #333333">=</span> pd<span style="color: #333333">.</span>DataFrame({<span style="background-color: #fff0f0">'Month'</span>:
[datetime<span style="color: #333333">.</span>date(<span style="color: #0000DD; font-weight: bold">2008</span>, i, <span style="color: #0000DD; font-weight: bold">1</span>)<span style="color: #333333">.</span>strftime(<span style="background-color: #fff0f0">'%B'</span>)
<span style="color: #008800; font-weight: bold">for</span> i <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">range</span>(<span style="color: #0000DD; font-weight: bold">1</span>,<span style="color: #0000DD; font-weight: bold">13</span>)] <span style="color: #333333">*</span> <span style="color: #0000DD; font-weight: bold">3</span>,
<span style="background-color: #fff0f0">'visited'</span>: [<span style="color: #008800; font-weight: bold">False</span>]<span style="color: #333333">*</span><span style="color: #0000DD; font-weight: bold">36</span>},
index<span style="color: #333333">=</span>[<span style="background-color: #fff0f0">'Alice'</span>]<span style="color: #333333">*</span><span style="color: #0000DD; font-weight: bold">12</span> <span style="color: #333333">+</span>
[<span style="background-color: #fff0f0">'Bob'</span>]<span style="color: #333333">*</span><span style="color: #0000DD; font-weight: bold">12</span> <span style="color: #333333">+</span>
[<span style="background-color: #fff0f0">'Bridgett'</span>]<span style="color: #333333">*</span><span style="color: #0000DD; font-weight: bold">12</span>)
df <span style="color: #333333">=</span> df<span style="color: #333333">.</span>reset_index()
<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">make_regular</span>(r, name):
r[<span style="background-color: #fff0f0">'visited'</span>] <span style="color: #333333">=</span> (r[<span style="background-color: #fff0f0">'visited'</span>] <span style="color: #000000; font-weight: bold">or</span> (r[<span style="background-color: #fff0f0">'index'</span>] <span style="color: #333333">==</span> name) <span style="color: #000000; font-weight: bold">and</span>
((r[<span style="background-color: #fff0f0">'Month'</span>] <span style="color: #333333">==</span> <span style="background-color: #fff0f0">'February'</span>) <span style="color: #000000; font-weight: bold">or</span>
(r[<span style="background-color: #fff0f0">'Month'</span>] <span style="color: #333333">==</span> <span style="background-color: #fff0f0">'March'</span>) <span style="color: #000000; font-weight: bold">or</span>
(r[<span style="background-color: #fff0f0">'Month'</span>] <span style="color: #333333">==</span> <span style="background-color: #fff0f0">'April'</span>)))
<span style="color: #008800; font-weight: bold">return</span> r
df <span style="color: #333333">=</span> df<span style="color: #333333">.</span>apply(make_regular, axis<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">1</span>, args<span style="color: #333333">=</span>(<span style="background-color: #fff0f0">'Alice'</span>,))
df <span style="color: #333333">=</span> df<span style="color: #333333">.</span>apply(make_regular, axis<span style="color: #333333">=</span><span style="color: #0000DD; font-weight: bold">1</span>, args<span style="color: #333333">=</span>(<span style="background-color: #fff0f0">'Bob'</span>,))
regular <span style="color: #333333">=</span> ((df[<span style="background-color: #fff0f0">'visited'</span>] <span style="color: #333333">==</span> <span style="color: #008800; font-weight: bold">True</span>) <span style="color: #333333">&</span>
(df[<span style="background-color: #fff0f0">'visited'</span>]<span style="color: #333333">.</span>shift(<span style="color: #333333">-</span><span style="color: #0000DD; font-weight: bold">1</span>) <span style="color: #333333">==</span> <span style="color: #008800; font-weight: bold">True</span>) <span style="color: #333333">&</span>
(df[<span style="background-color: #fff0f0">'visited'</span>]<span style="color: #333333">.</span>shift(<span style="color: #333333">-</span><span style="color: #0000DD; font-weight: bold">2</span>) <span style="color: #333333">==</span> <span style="color: #008800; font-weight: bold">True</span>))
df[regular][<span style="background-color: #fff0f0">'index'</span>]<span style="color: #333333">.</span>values <span style="color: #333333">.</span>tolist()
</pre></td></tr></table></div>
<br />
Output:<br/>
<!-- HTML generated using hilite.me --><div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><table><tr><td><pre style="margin: 0; line-height: 125%">1</pre></td><td><pre style="margin: 0; line-height: 125%">[<span style="background-color: #fff0f0">'Alice'</span>, <span style="background-color: #fff0f0">'Bob'</span>]
</pre></td></tr></table></div>
<br />
<blockquote>
<br /></blockquote>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-91052215489620626012018-03-23T19:32:00.001-07:002018-03-23T19:33:29.677-07:00Pushing your code to pypi<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNI/h5fXdwA5EUgmjrGGTxleIj1Opf8WrfAzgCPcBGAYYCw/s1600/pylogo.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="116" data-original-width="116" src="https://3.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNI/h5fXdwA5EUgmjrGGTxleIj1Opf8WrfAzgCPcBGAYYCw/s1600/pylogo.jpeg" /></a></div>
<br />
<br />
<a href="http://peterdowns.com/posts/first-time-with-pypi.html" target="_blank">Here is a good document</a> that describes how to push your code to the Pypi repository.<br />
<div>
<br /></div>
<div>
A URL has changed slightly. In your ~/.pypirc set the URL as follows:</div>
<div>
<br /></div>
<div>
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
<br />
<div class="p1">
<span class="s1">[pypitest]</span></div>
<div class="p1">
<span class="s1">repository=https://test.pypi.org/legacy/</span></div>
<div class="p1">
<span class="s1"><span style="font-family: "times"; font-size: small; font-variant-ligatures: normal;"><br /></span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "times"; font-size: small; font-variant-ligatures: normal;">The register step is no longer required. All you need to do is upload the files.</span></span></div>
<div class="p1">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
</div>
<div class="p1">
<span class="s1"><br /></span></div>
<div class="p1">
<span class="s1">python setup.py sdist upload -r pypitest</span></div>
<div class="p1">
<br /></div>
<div class="p1">
<span style="font-family: "times"; font-size: small;">Each time you initiate an upload, you'd need to change the version number and the URL.</span><br />
<span style="font-family: "times"; font-size: small;"><br /></span>
<span style="font-family: "times"; font-size: small;">While this uploaded the package to test.pypi.org, the upload steps had changed for pypi.org:</span><br />
<span style="font-family: "times"; font-size: small;"><br /></span>
<span style="font-family: "times"; font-size: small;">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff; min-height: 13.0px}
span.s1 {font-variant-ligatures: no-common-ligatures}
span.s2 {font-variant-ligatures: no-common-ligatures; color: #33bbc8}
</style>
</span><br />
<div class="p1">
<span style="font-family: "times"; font-size: small;"><span class="s1">thushara@ figleaf </span><span class="s2">(master)</span><span class="s1">$ python setup.py sdist upload -r pypi</span></span></div>
<span style="font-family: "times"; font-size: small;">
</span>
<div class="p1">
<span style="font-family: "times"; font-size: small;"><span class="s1">/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'</span></span></div>
<span style="font-family: "times"; font-size: small;">
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>warnings.warn(msg)</span></div>
<div class="p1">
<span class="s1">running sdist</span></div>
<div class="p1">
<span class="s1">running check</span></div>
<div class="p1">
<span class="s1">warning: sdist: manifest template 'MANIFEST.in' does not exist (using default file list)</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">warning: sdist: standard file not found: should have one of README, README.txt</span></div>
<div class="p2">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">writing manifest file 'MANIFEST'</span></div>
<div class="p1">
<span class="s1">creating figleaf-0.2</span></div>
<div class="p1">
<span class="s1">creating figleaf-0.2/figleaf</span></div>
<div class="p1">
<span class="s1">making hard links in figleaf-0.2...</span></div>
<div class="p1">
<span class="s1">hard linking setup.cfg -> figleaf-0.2</span></div>
<div class="p1">
<span class="s1">hard linking setup.py -> figleaf-0.2</span></div>
<div class="p1">
<span class="s1">hard linking figleaf/__init__.py -> figleaf-0.2/figleaf</span></div>
<div class="p1">
<span class="s1">hard linking figleaf/graph.py -> figleaf-0.2/figleaf</span></div>
<div class="p1">
<span class="s1">Creating tar archive</span></div>
<div class="p1">
<span class="s1">removing 'figleaf-0.2' (and everything under it)</span></div>
<div class="p1">
<span class="s1">running upload</span></div>
<div class="p1">
<span class="s1">Submitting dist/figleaf-0.2.tar.gz to https://pypi.python.org/pypi</span></div>
<div class="p1">
<span class="s1">Upload failed (410): Gone (This API has been deprecated and removed from legacy PyPI in favor of using the APIs available in the new PyPI.org implementation of PyPI (located at https://pypi.org/). For more information about migrating your use of this API to PyPI.org, please see https://packaging.python.org/guides/migrating-to-pypi-org/#uploading. For more information about the sunsetting of this API, please see https://mail.python.org/pipermail/distutils-sig/2017-June/030766.html)</span></div>
<div class="p1">
<span class="s1">error: Upload failed (410): Gone (This API has been deprecated and removed from legacy PyPI in favor of using the APIs available in the new PyPI.org implementation of PyPI (located at https://pypi.org/). For more information about migrating your use of this API to PyPI.org, please see https://packaging.python.org/guides/migrating-to-pypi-org/#uploading. For more information about the sunsetting of this API, please see https://mail.python.org/pipermail/distutils-sig/2017-June/030766.html)</span></div>
<div class="p1">
<br /></div>
<div>
<span style="font-family: "times"; font-size: small;">To upload to pypi I used <a href="https://pypi.python.org/pypi/twine" target="_blank">twine</a>. Installing that on MacOS High Sierra required the removal of SIP.</span></div>
<div>
<span style="font-family: "times"; font-size: small;"><br /></span></div>
<div>
In ~/.pypirc, I removed the repository line under [pypi]</div>
<div>
<br /></div>
</span><br />
<blockquote class="tr_bq">
python setup.py sdist</blockquote>
</div>
<div class="p1">
<br /></div>
<div class="p1">
Remove old tars under dist, and</div>
<div class="p1">
<br /></div>
<blockquote class="tr_bq">
twine upload dist/*</blockquote>
</div>
<div>
<br /></div>
<div>
Now I could see the <a href="https://pypi.python.org/pypi/wildhops/0.2" target="_blank">project under pypi</a></div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-43747064221871448342018-03-23T18:47:00.002-07:002018-03-23T18:48:19.568-07:00Installing Twine on MacOS High Sierra<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #c33720; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
span.s2 {font-variant-ligatures: no-common-ligatures; color: #33bbc8}
span.s3 {font-variant-ligatures: no-common-ligatures; color: #c33720}
</style>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNI/h5fXdwA5EUgmjrGGTxleIj1Opf8WrfAzgCPcBGAYYCw/s1600/pylogo.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="116" data-original-width="116" src="https://3.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNI/h5fXdwA5EUgmjrGGTxleIj1Opf8WrfAzgCPcBGAYYCw/s1600/pylogo.jpeg" /></a></div>
<div class="p1">
<span class="s1">thushara@ wildhops </span><span class="s2">(master)</span><span class="s3">*</span><span class="s1">$ sudo -H pip install twine</span></div>
<div class="p1">
<span class="s1">Password:</span></div>
<div class="p1">
<span class="s1">Collecting twine</span></div>
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>Downloading twine-1.11.0-py2.py3-none-any.whl</span></div>
<div class="p1">
<span class="s1">Collecting pkginfo>=1.4.2 (from twine)</span></div>
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>Downloading pkginfo-1.4.2-py2.py3-none-any.whl</span></div>
<div class="p1">
<span class="s1">Requirement already satisfied: setuptools>=0.7.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twine)</span></div>
<div class="p1">
<span class="s1">Collecting tqdm>=4.14 (from twine)</span></div>
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>Downloading tqdm-4.19.8-py2.py3-none-any.whl (52kB)</span></div>
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>100% |████████████████████████████████| 61kB 2.1MB/s<span class="Apple-converted-space"> </span></span></div>
<div class="p1">
<span class="s1">Collecting requests-toolbelt>=0.8.0 (from twine)</span></div>
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>Downloading requests_toolbelt-0.8.0-py2.py3-none-any.whl (54kB)</span></div>
<div class="p1">
<span class="s1"><span class="Apple-converted-space"> </span>100% |████████████████████████████████| 61kB 1.6MB/s<span class="Apple-converted-space"> </span></span></div>
<div class="p1">
<span class="s1">Requirement already satisfied: requests!=2.15,!=2.16,>=2.5.0 in /Library/Python/2.7/site-packages (from twine)</span></div>
<div class="p1">
<span class="s1">Installing collected packages: pkginfo, tqdm, requests-toolbelt, twine</span></div>
<div class="p2">
<span class="s1">Exception:</span></div>
<div class="p2">
<span class="s1">Traceback (most recent call last):</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 215, in main</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>status = self.run(options, args)</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 342, in run</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>prefix=options.prefix_path,</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 784, in install</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>**kwargs</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 851, in install</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>self.move_wheel_files(self.source_dir, root=root, prefix=prefix)</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 1064, in move_wheel_files</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>isolated=self.isolated,</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/wheel.py", line 377, in move_wheel_files</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>clobber(source, dest, False, fixer=fixer, filter=filter)</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/wheel.py", line 316, in clobber</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>ensure_dir(destdir)</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 83, in ensure_dir</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>os.makedirs(path)</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 150, in makedirs</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>makedirs(head, mode)</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs</span></div>
<div class="p2">
<span class="s1"><span class="Apple-converted-space"> </span>mkdir(name, mode)</span></div>
<div class="p2">
<span class="s1">OSError: [Errno 1] Operation not permitted: '/System/Library/Frameworks/Python.framework/Versions/2.7/man'</span></div>
<br />
The only way to get write access under /System is to boot into Recovery Mode and run this command on the Terminal:<br />
<br />
<span style="background-color: #eff0f1; color: #242729; font-family: "consolas" , "menlo" , "monaco" , "lucida console" , "liberation mono" , "dejavu sans mono" , "bitstream vera sans mono" , "courier new" , monospace , sans-serif; font-size: 13px; white-space: pre-wrap;">csrutil disable</span><br />
<span style="background-color: #eff0f1; color: #242729; font-family: "consolas" , "menlo" , "monaco" , "lucida console" , "liberation mono" , "dejavu sans mono" , "bitstream vera sans mono" , "courier new" , monospace , sans-serif; font-size: 13px; white-space: pre-wrap;"><br /></span>
<span style="background-color: #eff0f1; color: #242729; font-family: "consolas" , "menlo" , "monaco" , "lucida console" , "liberation mono" , "dejavu sans mono" , "bitstream vera sans mono" , "courier new" , monospace , sans-serif; font-size: 13px; white-space: pre-wrap;"><br /></span>
Reboot, install again<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-75057901324902047932018-03-22T16:03:00.000-07:002018-03-22T16:03:13.517-07:00A Graph in Python - and pythonic surprises<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-Vc2pgIaWl84/WrQ1hJbFE7I/AAAAAAAABaU/nohDwuV3vAkVVjkT2xYi_bgModiIOt_DACLcBGAs/s1600/graph.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="300" data-original-width="300" height="200" src="https://1.bp.blogspot.com/-Vc2pgIaWl84/WrQ1hJbFE7I/AAAAAAAABaU/nohDwuV3vAkVVjkT2xYi_bgModiIOt_DACLcBGAs/s200/graph.png" width="200" /></a></div>
I started implementing a Graph in python for a project and I encountered an unexpected behavior. See if you can spot the problem.<br />
<br />
Code for the graph is <a href="https://gist.github.com/thushw/e495a494512e280d583c4d2333080444" target="_blank">here</a>:<br />
<br />
<br />
<br />
However this is buggy. Each time an edge is added to one node, it gets added to all the nodes. Adding an edge from 'bellevue' to 'lynwood' added the edge to both vertices 'bellevue' and 'lynwood'.<br />
<br />
Code/Output:<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">g.add_node<span style="color: #333333;">(</span>GraphNode<span style="color: #333333;">(</span><span style="background-color: #fff0f0;">'seattle'</span>, <span style="color: #333333;">[</span>Edge<span style="color: #333333;">(</span><span style="background-color: #fff0f0;">'seattle'</span>, <span style="background-color: #fff0f0;">'bellevue'</span>, <span style="background-color: #fff0f0;">'dist'</span>, 10<span style="color: #333333;">)</span>, Edge<span style="color: #333333;">(</span><span style="background-color: #fff0f0;">'seattle'</span>, <span style="background-color: #fff0f0;">'lynwood'</span>, <span style="background-color: #fff0f0;">'dist'</span>, 20<span style="color: #333333;">)]))</span>
g.add_edge<span style="color: #333333;">((</span><span style="background-color: #fff0f0;">'bellevue'</span>, <span style="background-color: #fff0f0;">'lynwood'</span>, <span style="background-color: #fff0f0;">'dist'</span>, 5<span style="color: #333333;">))</span>
print <span style="color: #333333;">(</span>g<span style="color: #333333;">)</span>
bellevue -> bellevue:lynwood:dist:5
lynwood -> bellevue:lynwood:dist:5
seattle -> seattle:bellevue:dist:10 seattle:lynwood:dist:20
</pre>
</div>
<br />
After a lengthy debugging stint, the issue was identified to be the way <a href="http://docs.python-guide.org/en/latest/writing/gotchas/" target="_blank">Python evaluates default argument values to functions</a>.
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-1429516224347818662017-12-07T20:36:00.001-08:002018-03-22T15:34:54.257-07:00Python : Common pitfalls<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><b>Join a list if and only if all values in the list are strings:</b>
Code:
print <span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"running %s"</span> % <span style="background-color: #fff0f0;">' '</span>.join<span style="color: #333333;">(</span>cmd<span style="color: #333333;">))</span>
Error:
Traceback <span style="color: #333333;">(</span>most recent call last<span style="color: #333333;">)</span>:
File <span style="background-color: #fff0f0;">"test.py"</span>, line 29, in <module>
print <span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"running %s"</span> % <span style="background-color: #fff0f0;">' '</span>.join<span style="color: #333333;">(</span>cmd<span style="color: #333333;">))</span>
TypeError: sequence item 4: expected string, int found
Cause:
There are non-strings in the list cmd.
Ex:
<span style="color: #996633;">cmd</span> <span style="color: #333333;">=</span> <span style="color: #333333;">[</span><span style="background-color: #fff0f0;">"runthis.py"</span>, <span style="background-color: #fff0f0;">"--host"</span>, host, <span style="background-color: #fff0f0;">"--port"</span>, port<span style="color: #333333;">]</span>
To make join happy:
<span style="color: #996633;">cmd</span> <span style="color: #333333;">=</span> <span style="color: #333333;">[</span><span style="background-color: #fff0f0;">"runthis.py"</span>, <span style="background-color: #fff0f0;">"--host"</span>, host, <span style="background-color: #fff0f0;">"--port"</span>, str<span style="color: #333333;">(</span>port<span style="color: #333333;">)]</span>
<b>Avoid default values in mutable arguments:</b>
Code:
class A:
def __init__<span style="color: #333333;">(</span>self, <span style="color: #996633;">lst</span><span style="color: #333333;">=[])</span>:
self.lst <span style="color: #333333;">=</span> lst
<span style="color: #996633;">a</span> <span style="color: #333333;">=</span> A<span style="color: #333333;">()</span>
<span style="color: #996633;">b</span> <span style="color: #333333;">=</span> A<span style="color: #333333;">()</span>
a.lst.append<span style="color: #333333;">(</span><span style="background-color: #fff0f0;">'crocs'</span><span style="color: #333333;">)</span>
print <span style="color: #333333;">(</span>b.lst<span style="color: #333333;">)</span>
Output:
<span style="color: #333333;">[</span><span style="background-color: #fff0f0;">'crocs'</span><span style="color: #333333;">]</span>
Cause:
Python evaluates default arguments to a function at the time the function</pre>
<pre style="line-height: 125%; margin: 0;">is defined, not each time it is called. So all instances will mutate a single </pre>
<pre style="line-height: 125%; margin: 0;">list.</pre>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-49351136599908774182017-11-05T12:19:00.000-08:002017-11-08T16:53:02.881-08:00A Beautiful Soupy Exercise in Scraping Interesting IntegersIntegers can be very interesting at least if you are a mathematician, and even for a lay person like me, interesting integers can be used to spice up some data that is of interest. My interest here being the bike counter installed on the Fremont Bridge sidewalk that counts the number of bicycles crossing the bridge in both directions.<br />
<div>
<br /></div>
<div>
Inspired by this idea of <a href="http://rooreynolds.com/2008/04/24/blogjects-and-tweetjects/">blogjects</a> and <a href="http://stanford-clark.com/andy_house.html">twittering houses</a>, I wanted to send out an early morning tweet of the number of cyclists who braved the streets across Fremont the day before. The data is uploaded by SDOT early morning, and a cron task would request for this and tweet it.</div>
<div>
<br /></div>
<div>
Simple and a tad boring. Now what if I could map the count to something interesting? Researching on interesting integers, I came up on the theorem that there are no <a href="https://en.wikipedia.org/wiki/Interesting_number_paradox">uninteresting integers</a> because after all if there were a bunch of these, and one of them must be the smallest of the lot, and the fact itself makes this number interesting.</div>
<div>
<br /></div>
<div>
Energized in no small measure by this revelation, I sallied forth to find a list of interesting integers in the thousands range, as every day there were ~ 3000 cyclists being logged. It didn't take long for me to reach a comprehensive page of <a href="http://www2.stetson.edu/~efriedma/numbers.html">integers</a>. The pattern is an integer within a <font> tag, followed by a phrase that describes it.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-6JiTxS7mwhk/Wfy1uR_D9pI/AAAAAAAABOk/04ueeKroksoGLrIhiVCScwv_vUt1vYXIQCLcBGAs/s1600/numbers.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="238" data-original-width="504" height="151" src="https://2.bp.blogspot.com/-6JiTxS7mwhk/Wfy1uR_D9pI/AAAAAAAABOk/04ueeKroksoGLrIhiVCScwv_vUt1vYXIQCLcBGAs/s320/numbers.png" width="320" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<span style="font-family: monospace; white-space: pre-wrap;">
</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">gray</span>></span><span style="font-family: monospace; white-space: pre-wrap;">0</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> is the </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/AdditiveIdentity.html" target="_blank">http://mathworld.wolfram.com/AdditiveIdentity.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">additive identity</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span><span style="font-family: monospace; white-space: pre-wrap;">
</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">gray</span>></span><span style="font-family: monospace; white-space: pre-wrap;">1</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> is the </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/MultiplicativeIdentity.html" target="_blank">http://mathworld.wolfram.com/MultiplicativeIdentity.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">multiplicative identity</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span><span style="font-family: monospace; white-space: pre-wrap;">
</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">darkblue</span>></span><span style="font-family: monospace; white-space: pre-wrap;">2</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> is the only even </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/PrimeNumber.html" target="_blank">http://mathworld.wolfram.com/PrimeNumber.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">prime</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span></div>
<div>
<br /></div>
<div>
At first glance, it seemed a simple matter of using <a href="https://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> to get each font tag, extract its text, then look for the font's sibling to extract the phrase.</div>
<div>
<br /></div>
<div>
However, the font tag has multiple siblings that make up the complete phrase. In the soup these are represented as NavigableString objects. It's a matter of moving across the document until we hit a <br> tag, collecting all the text as we go along.</div>
<div>
<br /></div>
<div>
Now since all of this needs to be part of the tweet, I quickly realized that not much can be said in 140 characters. So I didn't bother keeping the URLs. I used a jupyter notebook to quickly prototype the outline, and I can't stress enough how useful this is, specially when you are dealing with an unfamiliar API (which Beautiful soup was to me).</div>
<div>
<br /></div>
<div>
Here is how I used the notebook to understand the basic structure of the page:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-YQGlAu68Ulg/Wfy5NcOohOI/AAAAAAAABO8/ZFRIwDhVTZYJxagkqab3eO_cLIpkyHOFgCLcBGAs/s1600/fiddlesoup.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="396" src="https://1.bp.blogspot.com/-YQGlAu68Ulg/Wfy5NcOohOI/AAAAAAAABO8/ZFRIwDhVTZYJxagkqab3eO_cLIpkyHOFgCLcBGAs/s640/fiddlesoup.png" width="640" /></a></div>
<div>
<br /></div>
<div>
So getting to the integers was quite trivial as BeautifuSoup provides a way to search for a specific tag (font) with a specific value for a given attribute (size=+3). Since the phrase for the integer is in a number of contiguous elements, we need to construct it by visiting siblings of the font tag until we hit a <br> tag.</div>
<div>
<br /></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">unexpected_tags <span style="color: #333333;">=</span> {}
<span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">get_text_to_eol</span>(font_section):
text_parts <span style="color: #333333;">=</span> []
section <span style="color: #333333;">=</span> font_section<span style="color: #333333;">.</span>next_sibling
<span style="color: #008800; font-weight: bold;">while</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">!=</span> <span style="background-color: #fff0f0;">'br'</span>:
<span style="color: #008800; font-weight: bold;">if</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">==</span> <span style="background-color: #fff0f0;">'a'</span>:
text_parts<span style="color: #333333;">.</span>append(section<span style="color: #333333;">.</span>string)
<span style="color: #008800; font-weight: bold;">elif</span> section<span style="color: #333333;">.</span>name <span style="color: black; font-weight: bold;">is</span> <span style="color: #007020;">None</span>:
text_parts<span style="color: #333333;">.</span>append(<span style="color: #007020;">str</span>(section))
<span style="color: #008800; font-weight: bold;">else</span>:
<span style="color: #008800; font-weight: bold;">print</span> (<span style="background-color: #fff0f0;">"found </span><span style="background-color: #eeeeee;">%s</span><span style="background-color: #fff0f0;"> tag"</span> <span style="color: #333333;">%</span> section<span style="color: #333333;">.</span>name)
unexpected_tags[section<span style="color: #333333;">.</span>name] <span style="color: #333333;">=</span> unexpected_tags<span style="color: #333333;">.</span>get(section<span style="color: #333333;">.</span>name, <span style="color: #0000dd; font-weight: bold;">0</span>)<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>
section <span style="color: #333333;">=</span> section<span style="color: #333333;">.</span>next_sibling
<span style="color: #008800; font-weight: bold;">return</span> <span style="background-color: #fff0f0;">' '</span><span style="color: #333333;">.</span>join(text_parts)
</pre>
</div>
<br />
Now I ran through the results in jupyter, and the first few are shown below:<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">for</span> number, text <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">map</span>(<span style="color: #008800; font-weight: bold;">lambda</span> section: (section<span style="color: #333333;">.</span>get_text(), get_text_to_eol(section)), integer_sections):
<span style="color: #008800; font-weight: bold;">print</span> (number, text)
</pre>
</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #0000dd; font-weight: bold;">0</span> <span style="color: black; font-weight: bold;">is</span> the additive identity <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: black; font-weight: bold;">is</span> the multiplicative identity <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">2</span> <span style="color: black; font-weight: bold;">is</span> the only even prime <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">3</span> <span style="color: black; font-weight: bold;">is</span> the number of spatial dimensions we live <span style="color: black; font-weight: bold;">in</span><span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">4</span> <span style="color: black; font-weight: bold;">is</span> the smallest number of colors sufficient to color <span style="color: #007020;">all</span> planar maps<span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">5</span> <span style="color: black; font-weight: bold;">is</span> the number of Platonic solids <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">6</span> <span style="color: black; font-weight: bold;">is</span> the smallest perfect number <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">7</span> <span style="color: black; font-weight: bold;">is</span> the smallest number of sides of a regular polygon that <span style="color: black; font-weight: bold;">is</span> <span style="color: black; font-weight: bold;">not</span> constructible by straightedge <span style="color: black; font-weight: bold;">and</span> compass<span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">8</span> <span style="color: black; font-weight: bold;">is</span> the largest cube <span style="color: black; font-weight: bold;">in</span> the Fibonacci sequence <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">9</span> <span style="color: black; font-weight: bold;">is</span> the maximum number of cubes that are needed to <span style="color: #007020;">sum</span> to <span style="color: #007020;">any</span> positive integer <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">10</span> <span style="color: black; font-weight: bold;">is</span> the base of our number system<span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">11</span> <span style="color: black; font-weight: bold;">is</span> the largest known multiplicative persistence <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">12</span> <span style="color: black; font-weight: bold;">is</span> the smallest abundant number <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">13</span> <span style="color: black; font-weight: bold;">is</span> the number of Archimedean solids <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">14</span> <span style="color: black; font-weight: bold;">is</span> the smallest even number n <span style="color: #008800; font-weight: bold;">with</span> no solutions to <span style="background-color: #ffaaaa; color: red;">φ</span> (m) <span style="color: #333333;">=</span> n<span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">15</span> <span style="color: black; font-weight: bold;">is</span> the smallest composite number n <span style="color: #008800; font-weight: bold;">with</span> the <span style="color: #007020;">property</span> that there <span style="color: black; font-weight: bold;">is</span> only one group of order n<span style="color: #333333;">.</span>
found sup tag
found sup tag
<span style="color: #0000dd; font-weight: bold;">16</span> <span style="color: black; font-weight: bold;">is</span> the only number of the form x <span style="color: #333333;">=</span> y <span style="color: #008800; font-weight: bold;">with</span> x <span style="color: black; font-weight: bold;">and</span> y being different integers <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">17</span> <span style="color: black; font-weight: bold;">is</span> the number of wallpaper groups <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">18</span> <span style="color: black; font-weight: bold;">is</span> the only positive number that <span style="color: black; font-weight: bold;">is</span> twice the <span style="color: #007020;">sum</span> of its digits<span style="color: #333333;">.</span>
found sup tag
<span style="color: #0000dd; font-weight: bold;">19</span> <span style="color: black; font-weight: bold;">is</span> the maximum number of <span style="color: #0000dd; font-weight: bold;">4</span> powers needed to <span style="color: #007020;">sum</span> to <span style="color: #007020;">any</span> number<span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">20</span> <span style="color: black; font-weight: bold;">is</span> the number of rooted trees <span style="color: #008800; font-weight: bold;">with</span> <span style="color: #0000dd; font-weight: bold;">6</span> vertices<span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">21</span> <span style="color: black; font-weight: bold;">is</span> the smallest number of distinct squares needed to tile a square <span style="color: #333333;">.</span>
<span style="color: #0000dd; font-weight: bold;">22</span> <span style="color: black; font-weight: bold;">is</span> the number of partitions of <span style="color: #6600ee; font-weight: bold;">8.</span>
</pre>
</div>
<br />
Since I collected the unknown tags, I could see what they were:
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-saJlm6y1SXE/Wfy88uAlKJI/AAAAAAAABPU/05kmfcs3luIg2NeS3S8z8n-Eo07gF8IEgCLcBGAs/s1600/tags.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="107" data-original-width="1600" height="42" src="https://2.bp.blogspot.com/-saJlm6y1SXE/Wfy88uAlKJI/AAAAAAAABPU/05kmfcs3luIg2NeS3S8z8n-Eo07gF8IEgCLcBGAs/s640/tags.png" width="640" /></a></div>
<br />
Here is an example of a superscript being used:<br />
<span style="font-family: monospace; white-space: pre-wrap;">
</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">FF6699</span>></span><span style="font-family: monospace; white-space: pre-wrap;">16</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> is the only number of the form x</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><sup></span><span style="font-family: monospace; white-space: pre-wrap;">y</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></sup></span><span style="font-family: monospace; white-space: pre-wrap;"> = y</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><sup></span><span style="font-family: monospace; white-space: pre-wrap;">x</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></sup></span><span style="font-family: monospace; white-space: pre-wrap;"> with x and y being different </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/Integer.html" target="_blank">http://mathworld.wolfram.com/Integer.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">integers</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span><br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-bWgv_ds3hxs/Wfy-WnCFPpI/AAAAAAAABPg/JBbA27Nn69UlwTPfN-H8q7lS1lys-xzSgCLcBGAs/s1600/tags.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="76" data-original-width="1172" height="25" src="https://1.bp.blogspot.com/-bWgv_ds3hxs/Wfy-WnCFPpI/AAAAAAAABPg/JBbA27Nn69UlwTPfN-H8q7lS1lys-xzSgCLcBGAs/s400/tags.png" width="400" /></a></div>
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br /></span>
<br />
<br />
<br />
Now isn't that interesting, out of all the numbers that this is the only one?<br />
<br />
Here is how the subscript is being used:<br />
<br />
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">brown</span>></span><span style="font-family: monospace; white-space: pre-wrap;">126</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> = </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><sub></span><span style="font-family: monospace; white-space: pre-wrap;">9</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></sub></span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/Combination.html" target="_blank">http://mathworld.wolfram.com/Combination.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">C</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><sub></span><span style="font-family: monospace; white-space: pre-wrap;">4</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></sub></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span><br />
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-YN53som9PPI/Wfy_X94PYSI/AAAAAAAABPs/B5BSjZdw-g8-U1O2FojOmM-SVwNHrpgKACLcBGAs/s1600/sub.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="78" data-original-width="222" src="https://1.bp.blogspot.com/-YN53som9PPI/Wfy_X94PYSI/AAAAAAAABPs/B5BSjZdw-g8-U1O2FojOmM-SVwNHrpgKACLcBGAs/s1600/sub.png" /></a></div>
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br /></span>
<br />
<br />
<br />
<br />
<br />
With this insight, I augmented the superscripts with the ^ symbol, and left the subscripts as is.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">get_text_to_eol</span>(font_section):
text_parts <span style="color: #333333;">=</span> []
section <span style="color: #333333;">=</span> font_section<span style="color: #333333;">.</span>next_sibling
<span style="color: #008800; font-weight: bold;">while</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">!=</span> <span style="background-color: #fff0f0;">'br'</span>:
<span style="color: #008800; font-weight: bold;">if</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">==</span> <span style="background-color: #fff0f0;">'a'</span>:
text_parts<span style="color: #333333;">.</span>append(section<span style="color: #333333;">.</span>string)
<span style="color: #008800; font-weight: bold;">elif</span> section<span style="color: #333333;">.</span>name <span style="color: black; font-weight: bold;">is</span> <span style="color: #007020;">None</span>:
text_parts<span style="color: #333333;">.</span>append(<span style="color: #007020;">str</span>(section))
<span style="color: #008800; font-weight: bold;">else</span>:
<span style="color: #008800; font-weight: bold;">if</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">==</span> <span style="background-color: #fff0f0;">'sup'</span>:
text_parts<span style="color: #333333;">.</span>append(<span style="background-color: #fff0f0;">'^'</span>)
text_parts<span style="color: #333333;">.</span>append(section<span style="color: #333333;">.</span>string)
section <span style="color: #333333;">=</span> section<span style="color: #333333;">.</span>next_sibling
<span style="color: #008800; font-weight: bold;">return</span> <span style="background-color: #fff0f0;">' '</span><span style="color: #333333;">.</span>join(text_parts)
</pre>
</div>
<br />
And I again digressed on a merry tangent where people were using <a href="https://www.buzzfeed.com/jwherrman/9-tweets-that-break-twitter?utm_term=.bqN6oEQpL#.pr1Gk1lbA">non-ascii</a> characters to tweet <a href="https://lingojam.com/TwitterFonts">subscripts</a> and superscripts.<br />
<br />
However, somewhat surprisingly, the sibling list did not always end in a <br>. I hit a None for a section for integer 248 with the html:<br />
<br />
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">006600</span>></span><span style="font-family: monospace; white-space: pre-wrap;">248</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> is the smallest number n>1 for which the </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/ArithmeticMean.html" target="_blank">http://mathworld.wolfram.com/ArithmeticMean.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">arithmetic</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">, </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/GeometricMean.html" target="_blank">http://mathworld.wolfram.com/GeometricMean.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">geometric</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">, and </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/HarmonicMean.html" target="_blank">http://mathworld.wolfram.com/HarmonicMean.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">harmonic means</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a/></span><span style="font-family: monospace; white-space: pre-wrap;"> of </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/TotientFunction.html" target="_blank">http://mathworld.wolfram.com/TotientFunction.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">&phi;</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">(n) and </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/DivisorFunction.html" target="_blank">http://mathworld.wolfram.com/DivisorFunction.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">&sigma;</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">(n) are all </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/Integer.html" target="_blank">http://mathworld.wolfram.com/Integer.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">integers</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span><br />
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br /></span>
Can you spot the problem, it is subtle?<br />
<br />
Notice that <span style="font-family: monospace; white-space: pre-wrap;">harmonic means</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a/></span> is not the correct encoding. Beautiful soup replaces this dangling tag with a beautiful pair <span style="background-color: white; font-size: 14px; white-space: pre-wrap;"><a></a></span>:<br />
<br />
<pre style="background-color: white; border-radius: 0px; border: 0px; box-sizing: border-box; font-size: 14px; line-height: inherit; overflow: auto; padding: 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a></a> of <a href="http://mathworld.wolfram.com/TotientFunction.html">φ</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">σ</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br/></a></pre>
<div>
<br /></div>
This is all very nice, except that we were relying on a <br> tag to be an eventual sibling, and Beautiful soup is on a soupy wake trying to find the matching </a> to the tag it started with, finally <b><span style="color: purple;">finding it at</span>:</b><br />
<br />
<span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><font <span class="html-attribute-name">size</span>=<span class="html-attribute-value">+3</span> <span class="html-attribute-name">color</span>=<span class="html-attribute-value">FF6699</span>></span><span style="font-family: monospace; white-space: pre-wrap;">1351</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></font></span><span style="font-family: monospace; white-space: pre-wrap;"> has the property that </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/e.html" target="_blank">http://mathworld.wolfram.com/e.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">e</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><sup></span><span style="font-family: monospace; white-space: pre-wrap;">1351</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><span style="color: purple;"><b></a></b></span></span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></sup></span><span style="font-family: monospace; white-space: pre-wrap;"> is within .0009 of an </span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><a <span class="html-attribute-name">href</span>="<a class="html-attribute-value html-external-link" href="http://mathworld.wolfram.com/Integer.html" target="_blank">http://mathworld.wolfram.com/Integer.html</a>"></span><span style="font-family: monospace; white-space: pre-wrap;">integer</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"></a></span><span style="font-family: monospace; white-space: pre-wrap;">.</span><span class="html-tag" style="font-family: monospace; white-space: pre-wrap;"><br></span><br />
<br />
We are getting all the integers from 248 through 1351 in one unbroken block.<br />
<br />
Is there then an easier way to solve this problem? It's tempting to think regular expressions when it comes to html parsing issues of this sort. What if we use a regular expression to split apart the sections containing the integer and phrase? After all, using a top down parser snagged on a mismatched tag, but maybe a regular expression can give us a better behaved set of html tags which we can then parse individually with Beautiful Soup.<br />
<br />
We can get the html lines with a reg exp split.<br />
<br />
Since we decided to forego the URLs, we could construct a BeautifulSoup instance for each line, and call the get_text() method to strip all tags.<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">import</span> <span style="color: #0e84b5; font-weight: bold;">re</span>
lines <span style="color: #333333;">=</span> re<span style="color: #333333;">.</span>split(<span style="background-color: #fff0f0;">"<br>[</span><span style="background-color: #fff0f0; color: #666666; font-weight: bold;">\r\n</span><span style="background-color: #fff0f0;">\s]*"</span>, html)
list_lines <span style="color: #333333;">=</span> <span style="color: #007020;">list</span>(<span style="color: #007020;">filter</span>(<span style="color: #008800; font-weight: bold;">lambda</span> x: x <span style="color: black; font-weight: bold;">is</span> <span style="color: black; font-weight: bold;">not</span> <span style="color: #007020;">None</span>, [re<span style="color: #333333;">.</span>search(<span style="background-color: #fff0f0;">r"<font size=\+3 .*"</span>, line) <span style="color: #008800; font-weight: bold;">for</span> line <span style="color: black; font-weight: bold;">in</span> lines]))
text_lines <span style="color: #333333;">=</span> [BeautifulSoup(l<span style="color: #333333;">.</span>group(<span style="color: #0000dd; font-weight: bold;">0</span>), <span style="background-color: #fff0f0;">'html.parser'</span>)<span style="color: #333333;">.</span>get_text() <span style="color: #008800; font-weight: bold;">for</span> l <span style="color: black; font-weight: bold;">in</span> list_lines]
</pre>
</div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">[<span style="background-color: #fff0f0;">'0 is the additive identity.'</span>,
<span style="background-color: #fff0f0;">'1 is the multiplicative identity.'</span>,
<span style="background-color: #fff0f0;">'2 is the only even prime.'</span>,
<span style="background-color: #fff0f0;">'3 is the number of spatial dimensions we live in.'</span>,
<span style="background-color: #fff0f0;">'4 is the smallest number of colors sufficient to color all planar maps.'</span>,
<span style="background-color: #fff0f0;">'5 is the number of Platonic solids.'</span>,
<span style="background-color: #fff0f0;">'6 is the smallest perfect number.'</span>,
<span style="background-color: #fff0f0;">'7 is the smallest number of sides of a regular polygon that is not constructible by straightedge and compass.'</span>,
<span style="background-color: #fff0f0;">'8 is the largest cube in the Fibonacci sequence.'</span>,
<span style="background-color: #fff0f0;">'9 is the maximum number of cubes that are needed to sum to any positive integer.'</span>,
<span style="background-color: #fff0f0;">'10 is the base of our number system.'</span>,
<span style="background-color: #fff0f0;">'11 is the largest known multiplicative persistence.'</span>,
<span style="background-color: #fff0f0;">'12 is the smallest abundant number.'</span>,
<span style="background-color: #fff0f0;">'13 is the number of Archimedean solids.'</span>,
<span style="background-color: #fff0f0;">'14 is the smallest even number n with no solutions to φ(m) = n.'</span>,
<span style="background-color: #fff0f0;">'15 is the smallest composite number n with the property that there is only one group of order n.'</span>,
<span style="background-color: #fff0f0;">'16 is the only number of the form xy = yx with x and y being different integers.'</span>,
<span style="background-color: #fff0f0;">'17 is the number of wallpaper groups.'</span>,
</pre>
</div>
<br />
But now we are not identifying the superscripts, subscripts, as you can see from the output for integer 16.<br />
<br />
What we should do is to then use the regular expression to get the lines, apply the parser for each line, then use the function we wrote earlier to get the text within each line. Now since the parser can't go over a <br>, it just might result in a better extraction of phrases.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">import</span> <span style="color: #0e84b5; font-weight: bold;">re</span>
lines <span style="color: #333333;">=</span> re<span style="color: #333333;">.</span>split(<span style="background-color: #fff0f0;">"<br>[</span><span style="background-color: #fff0f0; color: #666666; font-weight: bold;">\r\n</span><span style="background-color: #fff0f0;">\s]*"</span>, html)
list_lines <span style="color: #333333;">=</span> <span style="color: #007020;">list</span>(<span style="color: #007020;">filter</span>(<span style="color: #008800; font-weight: bold;">lambda</span> x: x <span style="color: black; font-weight: bold;">is</span> <span style="color: black; font-weight: bold;">not</span> <span style="color: #007020;">None</span>, [re<span style="color: #333333;">.</span>search(<span style="background-color: #fff0f0;">r"<font size=\+3 .*"</span>, line) <span style="color: #008800; font-weight: bold;">for</span> line <span style="color: black; font-weight: bold;">in</span> lines]))
soups <span style="color: #333333;">=</span> [BeautifulSoup(l<span style="color: #333333;">.</span>group(<span style="color: #0000dd; font-weight: bold;">0</span>), <span style="background-color: #fff0f0;">'html.parser'</span>)<span style="color: #333333;">.</span>font <span style="color: #008800; font-weight: bold;">for</span> l <span style="color: black; font-weight: bold;">in</span> list_lines]
np_list <span style="color: #333333;">=</span> [(<span style="color: #007020;">int</span>(s<span style="color: #333333;">.</span>get_text()), get_text_to_eol(s)) <span style="color: #008800; font-weight: bold;">for</span> s <span style="color: black; font-weight: bold;">in</span> soups]
</pre>
</div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">(<span style="color: #0000dd; font-weight: bold;">7</span>,
<span style="background-color: #fff0f0;">' is the smallest number of sides of a regular polygon that is not constructible by straightedge and compass.'</span>),
(<span style="color: #0000dd; font-weight: bold;">8</span>, <span style="background-color: #fff0f0;">' is the largest cube in the Fibonacci sequence .'</span>),
(<span style="color: #0000dd; font-weight: bold;">9</span>,
<span style="background-color: #fff0f0;">' is the maximum number of cubes that are needed to sum to any positive integer .'</span>),
(<span style="color: #0000dd; font-weight: bold;">10</span>, <span style="background-color: #fff0f0;">' is the base of our number system.'</span>),
(<span style="color: #0000dd; font-weight: bold;">11</span>, <span style="background-color: #fff0f0;">' is the largest known multiplicative persistence .'</span>),
(<span style="color: #0000dd; font-weight: bold;">12</span>, <span style="background-color: #fff0f0;">' is the smallest abundant number .'</span>),
(<span style="color: #0000dd; font-weight: bold;">13</span>, <span style="background-color: #fff0f0;">' is the number of Archimedean solids .'</span>),
(<span style="color: #0000dd; font-weight: bold;">14</span>, <span style="background-color: #fff0f0;">' is the smallest even number n with no solutions to φ (m) = n.'</span>),
(<span style="color: #0000dd; font-weight: bold;">15</span>,
<span style="background-color: #fff0f0;">' is the smallest composite number n with the property that there is only one group of order n.'</span>),
(<span style="color: #0000dd; font-weight: bold;">16</span>,
<span style="background-color: #fff0f0;">' is the only number of the form x ^ y = y ^ x with x and y being different integers .'</span>),
(<span style="color: #0000dd; font-weight: bold;">17</span>, <span style="background-color: #fff0f0;">' is the number of wallpaper groups .'</span>),
</pre>
</div>
<br />
Now we convert this list of pairs to a dictionary, so we can quickly look up the integer:<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #007020;">hash</span> <span style="color: #333333;">=</span> <span style="color: #007020;">dict</span>(np_list)
</pre>
</div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">{<span style="color: #0000dd; font-weight: bold;">0</span>: <span style="background-color: #fff0f0;">' is the additive identity .'</span>,
<span style="color: #0000dd; font-weight: bold;">1</span>: <span style="background-color: #fff0f0;">' is the multiplicative identity .'</span>,
<span style="color: #0000dd; font-weight: bold;">2</span>: <span style="background-color: #fff0f0;">' is the only even prime .'</span>,
<span style="color: #0000dd; font-weight: bold;">3</span>: <span style="background-color: #fff0f0;">' is the number of spatial dimensions we live in.'</span>,
<span style="color: #0000dd; font-weight: bold;">4</span>: <span style="background-color: #fff0f0;">' is the smallest number of colors sufficient to color all planar maps.'</span>,
<span style="color: #0000dd; font-weight: bold;">5</span>: <span style="background-color: #fff0f0;">' is the number of Platonic solids .'</span>,
<span style="color: #0000dd; font-weight: bold;">6</span>: <span style="background-color: #fff0f0;">' is the smallest perfect number .'</span>,
<span style="color: #0000dd; font-weight: bold;">7</span>: <span style="background-color: #fff0f0;">' is the smallest number of sides of a regular polygon that is not constructible by straightedge and compass.'</span>,
<span style="color: #0000dd; font-weight: bold;">8</span>: <span style="background-color: #fff0f0;">' is the largest cube in the Fibonacci sequence .'</span>,
<span style="color: #0000dd; font-weight: bold;">9</span>: <span style="background-color: #fff0f0;">' is the maximum number of cubes that are needed to sum to any positive integer .'</span>,
<span style="color: #0000dd; font-weight: bold;">10</span>: <span style="background-color: #fff0f0;">' is the base of our number system.'</span>,
<span style="color: #0000dd; font-weight: bold;">11</span>: <span style="background-color: #fff0f0;">' is the largest known multiplicative persistence .'</span>,
<span style="color: #0000dd; font-weight: bold;">12</span>: <span style="background-color: #fff0f0;">' is the smallest abundant number .'</span>,
<span style="color: #0000dd; font-weight: bold;">13</span>: <span style="background-color: #fff0f0;">' is the number of Archimedean solids .'</span>,
<span style="color: #0000dd; font-weight: bold;">14</span>: <span style="background-color: #fff0f0;">' is the smallest even number n with no solutions to φ (m) = n.'</span>,
<span style="color: #0000dd; font-weight: bold;">15</span>: <span style="background-color: #fff0f0;">' is the smallest composite number n with the property that there is only one group of order n.'</span>,
<span style="color: #0000dd; font-weight: bold;">16</span>: <span style="background-color: #fff0f0;">' is the only number of the form x ^ y = y ^ x with x and y being different integers .'</span>,
</pre>
</div>
<br />
The get_text_to_eol() was modified to handle hitting the end of the sibling list without hitting a <br>. Also, we keep all strings unicode up until they need to be output, at which point a conversion to utf-8 is done.<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">get_text_to_eol</span>(font_section):
text_parts <span style="color: #333333;">=</span> []
section <span style="color: #333333;">=</span> font_section<span style="color: #333333;">.</span>next_sibling
<span style="color: #008800; font-weight: bold;">while</span> section <span style="color: black; font-weight: bold;">is</span> <span style="color: black; font-weight: bold;">not</span> <span style="color: #007020;">None</span>:
<span style="color: #008800; font-weight: bold;">if</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">==</span> <span style="background-color: #fff0f0;">'a'</span>:
text_parts<span style="color: #333333;">.</span>append(section<span style="color: #333333;">.</span>get_text())
<span style="color: #008800; font-weight: bold;">elif</span> section<span style="color: #333333;">.</span>name <span style="color: black; font-weight: bold;">is</span> <span style="color: #007020;">None</span>:
text_parts<span style="color: #333333;">.</span>append(<span style="color: #007020;">unicode</span>(section))
<span style="color: #008800; font-weight: bold;">else</span>:
<span style="color: #008800; font-weight: bold;">if</span> section<span style="color: #333333;">.</span>name <span style="color: #333333;">==</span> <span style="background-color: #fff0f0;">'sup'</span>:
text_parts<span style="color: #333333;">.</span>append(<span style="background-color: #fff0f0;">'^'</span>)
text_parts<span style="color: #333333;">.</span>append(section<span style="color: #333333;">.</span>get_text())
section <span style="color: #333333;">=</span> section<span style="color: #333333;">.</span>next_sibling
<span style="color: #008800; font-weight: bold;">return</span> <span style="background-color: #fff0f0;">' '</span><span style="color: #333333;">.</span>join(text_parts)
</pre>
</div>
<br />
I think this will do for our purposes. In the next post I will show how this was used along with the bike counter uploads to tweet early morning updates to twitter.<br />
<br />
The full source code, along with the twitter updates can be found <a href="https://github.com/thushw/fremont">here</a>.<br />
<br />Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-11265228.post-54667924169200286922017-08-12T20:10:00.000-07:002017-08-22T15:50:32.091-07:00A Trie<!-- HTML generated using hilite.me --><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-CCm_S4nrPQc/WZy1HC9a-dI/AAAAAAAABN0/uWlPW9iXeE8VKJeZZLM4pmrmcITyuosvgCLcBGAs/s1600/ds.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="225" data-original-width="225" height="200" src="https://3.bp.blogspot.com/-CCm_S4nrPQc/WZy1HC9a-dI/AAAAAAAABN0/uWlPW9iXeE8VKJeZZLM4pmrmcITyuosvgCLcBGAs/s200/ds.png" width="200" /></a></div>
This is a trie that uses a sentinel node to denote the end of a word. This is more space efficient than having to flag each node as to whether it denotes an end of a word. To quickly find the number of prefix matches, it stores the prefix count in the node.
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">class</span> <span style="color: #bb0066; font-weight: bold;">Trie</span> <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">;</span>
<span style="color: #333399; font-weight: bold;">int</span> count <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
Map<span style="color: #333333;"><</span>Character<span style="color: #333333;">,</span> Trie<span style="color: #333333;">></span> list <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> HashMap<span style="color: #333333;"><</span>Character<span style="color: #333333;">,</span> Trie<span style="color: #333333;">>();</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #0066bb; font-weight: bold;">Trie</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">ch</span> <span style="color: #333333;">=</span> ch<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie <span style="color: #0066bb; font-weight: bold;">add</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">list</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>node <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie newNode <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">list</span><span style="color: #333333;">.</span><span style="color: #0000cc;">put</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">,</span> newNode<span style="color: #333333;">);</span>
node <span style="color: #333333;">=</span> newNode<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><pre style="line-height: 16.25px;"><span style="color: #333333;"> //adding the count to the current node is preferable</span></pre>
<pre style="line-height: 16.25px;"><span style="color: #333333;"> //to adding to </span>the node that matches the character.</pre>
<pre style="line-height: 16.25px;"> //This way, we won't add to the sentinel node</pre>
<pre style="line-height: 16.25px;"> //and we add only in one place.</pre>
<span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">count</span><span style="color: #333333;">++;</span>
<span style="color: #008800; font-weight: bold;">return</span> node<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">size</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">count</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">private</span> Trie <span style="color: #0066bb; font-weight: bold;">findChar</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">list</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">boolean</span> <span style="color: #0066bb; font-weight: bold;">findWord</span><span style="color: #333333;">(</span>String word<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> <span style="color: #997700; font-weight: bold;">ch:</span> word<span style="color: #333333;">.</span><span style="color: #0000cc;">toCharArray</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
node <span style="color: #333333;">=</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">findChar</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>node <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">false</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span></pre>
<pre style="line-height: 125%; margin: 0;"> //we may have found a prefix, make sure it is a word</pre>
<pre style="line-height: 125%; margin: 0;"> //if it's a word, the list must have the sentinel.
<span style="color: #008800; font-weight: bold;">return</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">list</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">((</span><span style="color: #333399; font-weight: bold;">char</span><span style="color: #333333;">)</span><span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">)</span> <span style="color: #333333;">!=</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">findPartial</span><span style="color: #333333;">(</span>String prefix<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch <span style="color: #333333;">:</span> prefix<span style="color: #333333;">.</span><span style="color: #0000cc;">toCharArray</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
node <span style="color: #333333;">=</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">list</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>node <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">return</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">size</span><span style="color: #333333;">();</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">add</span><span style="color: #333333;">(</span>String s<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch <span style="color: #333333;">:</span> s<span style="color: #333333;">.</span><span style="color: #0000cc;">toCharArray</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
node <span style="color: #333333;">=</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">add</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span></pre>
<pre style="line-height: 125%; margin: 0;"> //add the sentinel to mark the end of the word.
node<span style="color: #333333;">.</span><span style="color: #0000cc;">add</span><span style="color: #333333;">((</span><span style="color: #333399; font-weight: bold;">char</span><span style="color: #333333;">)</span><span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span></pre>
</div>
<br />
Now it is possible to reduce the space taken by the trie further by using an array instead of the map. Knowing that we need to use only lower case letters, we can use the charater before 'a' as the sentinel, so that the array length is set to 27.<br />
<br />
Another space optimization comes about by using a single word (32 bits) to store both the character and the prefix count. Java uses two bytes for the char type, and we could do with one byte. But that still uses 5 bytes per Trie node, but we don't need the 2 billion range possible with 32 bits to represent the count of all prefixes for any English substring.<br />
<br />
The prefix count is highest on the root node, as all words have the head node character as the prefix. So the highest prefix count is the number of words in the dictionary. This is generally never more than 250, 000. We can safely use 24 bits which can represent 8 million as a signed integer.<br />
<br />
So we can combine the character and the prefix count to a single word.<br />
<br />
Is there anything else we could do? Yes - we could read all the words into our Trie and trim the list on each Trie node. This results from the observation that we rarely use all the slots in our list - Especially as the trie spans out, there are fewer number of new words. Thus we could find the last used index on the list, and create a new shorter list.<br />
<br />
Doing all of these drops the size of the trie from ~ 228M to ~ 68M.<br />
<br />
Here is an implementation.<br />
<br />
I store a random word list in pastebin for testing - there is code here that uses this, as well as pulling a dictionary of lower case words. If you use this, you will need to make sure the dictionary you substitute has only lower case words, so some pre-processing might be necessary - in particular, you are likely to find the hyphen (-) in some word which you will need to remove.<br />
<br />
Last but not least, the memory stats don't give an idea of the space saving due to garbage collector not being deterministic. I use the sizeInBytes() to recursively calculate the memory foot print of the Trie.<br />
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272</pre>
</td><td><pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">public</span> <span style="color: #008800; font-weight: bold;">class</span> <span style="color: #bb0066; font-weight: bold;">Trie</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">interface</span> <span style="color: #bb0066; font-weight: bold;">GetAndPut</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">put</span><span style="color: #333333;">(</span>Character ch<span style="color: #333333;">,</span> Trie trie<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie <span style="color: #0066bb; font-weight: bold;">get</span><span style="color: #333333;">(</span>Character ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">lastUsedIndex</span><span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie<span style="color: #333333;">[]</span> <span style="color: #0066bb; font-weight: bold;">children</span><span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">trim</span><span style="color: #333333;">();</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">class</span> <span style="color: #bb0066; font-weight: bold;">SuffixCharsWithMap</span> <span style="color: #008800; font-weight: bold;">implements</span> GetAndPut <span style="color: #333333;">{</span>
Map<span style="color: #333333;"><</span>Character<span style="color: #333333;">,</span> Trie<span style="color: #333333;">></span> list <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> HashMap<span style="color: #333333;"><</span>Character<span style="color: #333333;">,</span> Trie<span style="color: #333333;">>();</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">put</span><span style="color: #333333;">(</span>Character ch<span style="color: #333333;">,</span> Trie node<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
list<span style="color: #333333;">.</span><span style="color: #0000cc;">put</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">,</span> node<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie <span style="color: #0066bb; font-weight: bold;">get</span><span style="color: #333333;">(</span>Character ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> list<span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">lastUsedIndex</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> list<span style="color: #333333;">.</span><span style="color: #0000cc;">size</span><span style="color: #333333;">()-</span><span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie<span style="color: #333333;">[]</span> <span style="color: #0066bb; font-weight: bold;">children</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> list<span style="color: #333333;">.</span><span style="color: #0000cc;">values</span><span style="color: #333333;">().</span><span style="color: #0000cc;">toArray</span><span style="color: #333333;">(</span><span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">[</span>list<span style="color: #333333;">.</span><span style="color: #0000cc;">size</span><span style="color: #333333;">()]);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">trim</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">class</span> <span style="color: #bb0066; font-weight: bold;">SuffixCharsWithArray</span> <span style="color: #008800; font-weight: bold;">implements</span> GetAndPut <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie<span style="color: #333333;">[]</span> list<span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #0066bb; font-weight: bold;">SuffixCharsWithArray</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">int</span> sz <span style="color: #333333;">=</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)(</span><span style="color: #0044dd;">'z'</span><span style="color: #333333;">)</span> <span style="color: #333333;">-</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span><span style="color: #0044dd;">'`'</span> <span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">;</span>
list <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">[</span>sz<span style="color: #333333;">];</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">put</span><span style="color: #333333;">(</span>Character ch<span style="color: #333333;">,</span> Trie node<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
list<span style="color: #333333;">[(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span>ch <span style="color: #333333;">-</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span><span style="color: #0044dd;">'`'</span><span style="color: #333333;">]</span> <span style="color: #333333;">=</span> node<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie <span style="color: #0066bb; font-weight: bold;">get</span><span style="color: #333333;">(</span>Character ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">try</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> list<span style="color: #333333;">[(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span> ch <span style="color: #333333;">-</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span> <span style="color: #0044dd;">'`'</span><span style="color: #333333;">];</span>
<span style="color: #333333;">}</span> <span style="color: #008800; font-weight: bold;">catch</span> <span style="color: #333333;">(</span>ArrayIndexOutOfBoundsException e<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #888888;">// we hit an index that got trimmed out</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">lastUsedIndex</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">int</span> lue <span style="color: #333333;">=</span> <span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span> i<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span> i<span style="color: #333333;"><</span>list<span style="color: #333333;">.</span><span style="color: #0000cc;">length</span><span style="color: #333333;">;</span> i<span style="color: #333333;">++)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>list<span style="color: #333333;">[</span>i<span style="color: #333333;">]</span> <span style="color: #333333;">!=</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
lue <span style="color: #333333;">=</span> i<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">return</span> lue<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">trim</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>lastUsedIndex<span style="color: #333333;">()+</span><span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #333333;"><</span> list<span style="color: #333333;">.</span><span style="color: #0000cc;">length</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie<span style="color: #333333;">[]</span> newList <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">[</span>lastUsedIndex<span style="color: #333333;">()</span> <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">];</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span> i <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span> i <span style="color: #333333;"><</span> newList<span style="color: #333333;">.</span><span style="color: #0000cc;">length</span><span style="color: #333333;">;</span> i<span style="color: #333333;">++)</span> <span style="color: #333333;">{</span>
newList<span style="color: #333333;">[</span>i<span style="color: #333333;">]</span> <span style="color: #333333;">=</span> list<span style="color: #333333;">[</span>i<span style="color: #333333;">];</span>
<span style="color: #333333;">}</span>
list <span style="color: #333333;">=</span> newList<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie<span style="color: #333333;">[]</span> <span style="color: #0066bb; font-weight: bold;">children</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
List<span style="color: #333333;"><</span>Trie<span style="color: #333333;">></span> l <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> ArrayList<span style="color: #333333;"><</span>Trie<span style="color: #333333;">>();</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span>Trie <span style="color: #997700; font-weight: bold;">t:</span> list<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>t <span style="color: #333333;">!=</span> <span style="color: #008800; font-weight: bold;">null</span> <span style="color: #333333;">&&</span> t<span style="color: #333333;">.</span><span style="color: #0000cc;">getChar</span><span style="color: #333333;">()</span> <span style="color: #333333;">!=</span> <span style="color: #0044dd;">'`'</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
l<span style="color: #333333;">.</span><span style="color: #0000cc;">add</span><span style="color: #333333;">(</span>t<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">return</span> l<span style="color: #333333;">.</span><span style="color: #0000cc;">toArray</span><span style="color: #333333;">(</span><span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">[</span>l<span style="color: #333333;">.</span><span style="color: #0000cc;">size</span><span style="color: #333333;">()]);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #888888;">//store char and the prefix count using 32 bits</span>
<span style="color: #888888;">//the first byte is the character, the next 3 bytes get the prefix count</span>
<span style="color: #888888;">//3 bytes can hold ~ 16 million, and there aren't that many english</span>
<span style="color: #888888;">//words. the total word count is less than 250,000, and the prefix count</span>
<span style="color: #888888;">//of any substring is less than that.</span>
<span style="color: #008800; font-weight: bold;">private</span> <span style="color: #333399; font-weight: bold;">int</span> count <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">char</span> <span style="color: #0066bb; font-weight: bold;">getChar</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span><span style="color: #333333;">)(</span>count <span style="color: #333333;">&</span> <span style="color: #005588; font-weight: bold;">0xFF000000</span> <span style="color: #333333;">>></span> <span style="color: #0000dd; font-weight: bold;">24</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">setChar</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
count <span style="color: #333333;">=</span> <span style="color: #333333;">((</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span>ch<span style="color: #333333;">)</span> <span style="color: #333333;"><<</span> <span style="color: #0000dd; font-weight: bold;">24</span> <span style="color: #333333;">|</span> <span style="color: #333333;">(</span>count <span style="color: #333333;">&</span> <span style="color: #005588; font-weight: bold;">0x00FFFFFF</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">getCount</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> count <span style="color: #333333;">&</span> <span style="color: #005588; font-weight: bold;">0x00FFFFFF</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #888888;">//this is safe as the count will never get high enough</span>
<span style="color: #888888;">//to push over int the most significant byte holding the character</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">incCount</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
count<span style="color: #333333;">++;</span>
<span style="color: #333333;">}</span>
GetAndPut suffixChars <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> SuffixCharsWithArray<span style="color: #333333;">();</span>
<span style="color: #888888;">//GetAndPut suffixChars = new SuffixCharsWithMap();</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #0066bb; font-weight: bold;">Trie</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">setChar</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> Trie <span style="color: #0066bb; font-weight: bold;">add</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>node <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie newNode <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">put</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">,</span> newNode<span style="color: #333333;">);</span>
node <span style="color: #333333;">=</span> newNode<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #888888;">//adding the count to the current node is preferable</span>
<span style="color: #888888;">//to adding to the node that matches the character.</span>
<span style="color: #888888;">//This way, we won't add to the sentinel node</span>
<span style="color: #888888;">//and we add only in one place.</span>
<span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">incCount</span><span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">return</span> node<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">size</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">count</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">private</span> Trie <span style="color: #0066bb; font-weight: bold;">findChar</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">boolean</span> <span style="color: #0066bb; font-weight: bold;">findWord</span><span style="color: #333333;">(</span>String word<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> <span style="color: #997700; font-weight: bold;">ch:</span> word<span style="color: #333333;">.</span><span style="color: #0000cc;">toCharArray</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
node <span style="color: #333333;">=</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">findChar</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>node <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #008800; font-weight: bold;">false</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #888888;">//we may have found a prefix, make sure it is a word</span>
<span style="color: #888888;">//if it's a word, the list must have the sentinel.</span>
<span style="color: #008800; font-weight: bold;">return</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span><span style="color: #0044dd;">'`'</span><span style="color: #333333;">)</span> <span style="color: #333333;">!=</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">findPartial</span><span style="color: #333333;">(</span>String prefix<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch <span style="color: #333333;">:</span> prefix<span style="color: #333333;">.</span><span style="color: #0000cc;">toCharArray</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
node <span style="color: #333333;">=</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>node <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">return</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">size</span><span style="color: #333333;">();</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">add</span><span style="color: #333333;">(</span>String s<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Trie node <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">char</span> ch <span style="color: #333333;">:</span> s<span style="color: #333333;">.</span><span style="color: #0000cc;">toCharArray</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
node <span style="color: #333333;">=</span> node<span style="color: #333333;">.</span><span style="color: #0000cc;">add</span><span style="color: #333333;">(</span>ch<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #888888;">//add the sentinel to mark the end of the word.</span>
node<span style="color: #333333;">.</span><span style="color: #0000cc;">add</span><span style="color: #333333;">(</span><span style="color: #0044dd;">'`'</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">private</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">walk</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">[]</span> indices<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
indices<span style="color: #333333;">[</span><span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">lastUsedIndex</span><span style="color: #333333;">()]</span> <span style="color: #333333;">++;</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span>Trie ch <span style="color: #333333;">:</span> suffixChars<span style="color: #333333;">.</span><span style="color: #0000cc;">children</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
ch<span style="color: #333333;">.</span><span style="color: #0000cc;">walk</span><span style="color: #333333;">(</span>indices<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">[]</span> <span style="color: #0066bb; font-weight: bold;">lastUsedIndices</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">[]</span> indices <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> <span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">[(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span><span style="color: #0044dd;">'z'</span> <span style="color: #333333;">-</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">)</span><span style="color: #0044dd;">'`'</span> <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">];</span>
walk<span style="color: #333333;">(</span>indices<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">return</span> indices<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">private</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">walk2</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
suffixChars<span style="color: #333333;">.</span><span style="color: #0000cc;">trim</span><span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span>Trie ch <span style="color: #333333;">:</span> suffixChars<span style="color: #333333;">.</span><span style="color: #0000cc;">children</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
ch<span style="color: #333333;">.</span><span style="color: #0000cc;">walk2</span><span style="color: #333333;">();</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">private</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">walk3</span><span style="color: #333333;">(</span>Trie t<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>t <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
<span style="color: #888888;">// 4 = size of `count`</span>
<span style="color: #888888;">// 8 = size of each reference to a Trie</span>
<span style="color: #333399; font-weight: bold;">int</span> acc <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">4</span> <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">8</span> <span style="color: #333333;">*</span> <span style="color: #333333;">(((</span>SuffixCharsWithArray<span style="color: #333333;">)</span>t<span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">).</span><span style="color: #0000cc;">list</span><span style="color: #333333;">.</span><span style="color: #0000cc;">length</span><span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span>Trie <span style="color: #997700; font-weight: bold;">node:</span> t<span style="color: #333333;">.</span><span style="color: #0000cc;">suffixChars</span><span style="color: #333333;">.</span><span style="color: #0000cc;">children</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
acc <span style="color: #333333;">+=</span> walk3<span style="color: #333333;">(</span>node<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">return</span> acc<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">trim</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
walk2<span style="color: #333333;">();</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">sizeInBytes</span><span style="color: #333333;">()</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #0066bb; font-weight: bold;">walk3</span><span style="color: #333333;">(</span><span style="color: #008800; font-weight: bold;">this</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">read</span><span style="color: #333333;">()</span> <span style="color: #008800; font-weight: bold;">throws</span> FileNotFoundException <span style="color: #333333;">{</span>
String wordFilePath <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">"/Users/thushara/lcwords.txt"</span><span style="color: #333333;">;</span>
BufferedReader br <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> BufferedReader<span style="color: #333333;">(</span><span style="color: #008800; font-weight: bold;">new</span> FileReader<span style="color: #333333;">(</span>wordFilePath<span style="color: #333333;">));</span>
String word<span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">try</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">while</span> <span style="color: #333333;">((</span>word <span style="color: #333333;">=</span> br<span style="color: #333333;">.</span><span style="color: #0000cc;">readLine</span><span style="color: #333333;">())</span> <span style="color: #333333;">!=</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
add<span style="color: #333333;">(</span>word<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span> <span style="color: #008800; font-weight: bold;">catch</span> <span style="color: #333333;">(</span>IOException e<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">err</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"disk error! %s"</span><span style="color: #333333;">,</span> e<span style="color: #333333;">.</span><span style="color: #0000cc;">getMessage</span><span style="color: #333333;">());</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">public</span> String <span style="color: #0066bb; font-weight: bold;">getRandomWordList</span><span style="color: #333333;">()</span> <span style="color: #008800; font-weight: bold;">throws</span> MalformedURLException<span style="color: #333333;">,</span> IOException <span style="color: #333333;">{</span>
Pattern alpha <span style="color: #333333;">=</span> Pattern<span style="color: #333333;">.</span><span style="color: #0000cc;">compile</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"^[A-Za-z]+$"</span><span style="color: #333333;">);</span>
String url <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">"https://pastebin.com/raw/NXH7UAr1"</span><span style="color: #333333;">;</span>
URL obj <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> URL<span style="color: #333333;">(</span>url<span style="color: #333333;">);</span>
HttpURLConnection con <span style="color: #333333;">=</span> <span style="color: #333333;">(</span>HttpURLConnection<span style="color: #333333;">)</span> obj<span style="color: #333333;">.</span><span style="color: #0000cc;">openConnection</span><span style="color: #333333;">();</span>
con<span style="color: #333333;">.</span><span style="color: #0000cc;">setRequestMethod</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"GET"</span><span style="color: #333333;">);</span>
<span style="color: #333399; font-weight: bold;">int</span> responseCode <span style="color: #333333;">=</span> con<span style="color: #333333;">.</span><span style="color: #0000cc;">getResponseCode</span><span style="color: #333333;">();</span>
BufferedReader in <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> BufferedReader<span style="color: #333333;">(</span>
<span style="color: #008800; font-weight: bold;">new</span> <span style="color: #0066bb; font-weight: bold;">InputStreamReader</span><span style="color: #333333;">(</span>con<span style="color: #333333;">.</span><span style="color: #0000cc;">getInputStream</span><span style="color: #333333;">()));</span>
String inputLine<span style="color: #333333;">;</span>
StringBuffer response <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> StringBuffer<span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">while</span> <span style="color: #333333;">((</span>inputLine <span style="color: #333333;">=</span> in<span style="color: #333333;">.</span><span style="color: #0000cc;">readLine</span><span style="color: #333333;">())</span> <span style="color: #333333;">!=</span> <span style="color: #008800; font-weight: bold;">null</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Matcher m <span style="color: #333333;">=</span> alpha<span style="color: #333333;">.</span><span style="color: #0000cc;">matcher</span><span style="color: #333333;">(</span>inputLine<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>m<span style="color: #333333;">.</span><span style="color: #0000cc;">matches</span><span style="color: #333333;">())</span> <span style="color: #333333;">{</span>
response<span style="color: #333333;">.</span><span style="color: #0000cc;">append</span><span style="color: #333333;">(</span>inputLine<span style="color: #333333;">.</span><span style="color: #0000cc;">toLowerCase</span><span style="color: #333333;">());</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
in<span style="color: #333333;">.</span><span style="color: #0000cc;">close</span><span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">return</span> response<span style="color: #333333;">.</span><span style="color: #0000cc;">toString</span><span style="color: #333333;">();</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">public</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">main</span><span style="color: #333333;">(</span>String<span style="color: #333333;">[]</span> args<span style="color: #333333;">)</span> <span style="color: #008800; font-weight: bold;">throws</span> FileNotFoundException<span style="color: #333333;">,</span> IOException <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">long</span> mem1 <span style="color: #333333;">=</span> Runtime<span style="color: #333333;">.</span><span style="color: #0000cc;">getRuntime</span><span style="color: #333333;">().</span><span style="color: #0000cc;">totalMemory</span><span style="color: #333333;">()</span> <span style="color: #333333;">-</span> Runtime<span style="color: #333333;">.</span><span style="color: #0000cc;">getRuntime</span><span style="color: #333333;">().</span><span style="color: #0000cc;">freeMemory</span><span style="color: #333333;">();</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"memory usage at start %d\n"</span><span style="color: #333333;">,</span> mem1<span style="color: #333333;">);</span>
Trie trie <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Trie<span style="color: #333333;">(</span><span style="color: #0044dd;">'$'</span><span style="color: #333333;">);</span>
trie<span style="color: #333333;">.</span><span style="color: #0000cc;">read</span><span style="color: #333333;">();</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"size of trie in bytes: %d\n"</span><span style="color: #333333;">,</span> trie<span style="color: #333333;">.</span><span style="color: #0000cc;">sizeInBytes</span><span style="color: #333333;">());</span>
trie<span style="color: #333333;">.</span><span style="color: #0000cc;">trim</span><span style="color: #333333;">();</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"size of trimmed trie in bytes: %d\n"</span><span style="color: #333333;">,</span> trie<span style="color: #333333;">.</span><span style="color: #0000cc;">sizeInBytes</span><span style="color: #333333;">());</span>
Scanner in <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Scanner<span style="color: #333333;">(</span>System<span style="color: #333333;">.</span><span style="color: #0000cc;">in</span><span style="color: #333333;">);</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">println</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"type a word in lower case (upper case char to exit)> "</span><span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">while</span> <span style="color: #333333;">(</span><span style="color: #008800; font-weight: bold;">true</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
String s <span style="color: #333333;">=</span> in<span style="color: #333333;">.</span><span style="color: #0000cc;">next</span><span style="color: #333333;">();</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>Character<span style="color: #333333;">.</span><span style="color: #0000cc;">isUpperCase</span><span style="color: #333333;">(</span>s<span style="color: #333333;">.</span><span style="color: #0000cc;">charAt</span><span style="color: #333333;">(</span><span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">)))</span> <span style="color: #008800; font-weight: bold;">break</span><span style="color: #333333;">;</span>
<span style="color: #333399; font-weight: bold;">boolean</span> found <span style="color: #333333;">=</span> trie<span style="color: #333333;">.</span><span style="color: #0000cc;">findPartial</span><span style="color: #333333;">(</span>s<span style="color: #333333;">)</span> <span style="color: #333333;">></span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">println</span><span style="color: #333333;">(</span>found <span style="color: #333333;">?</span> <span style="background-color: #fff0f0;">"yes"</span> <span style="color: #333333;">:</span> <span style="background-color: #fff0f0;">"no"</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333399; font-weight: bold;">long</span> st <span style="color: #333333;">=</span> System<span style="color: #333333;">.</span><span style="color: #0000cc;">currentTimeMillis</span><span style="color: #333333;">();</span>
String words <span style="color: #333333;">=</span> getRandomWordList<span style="color: #333333;">();</span>
String<span style="color: #333333;">[]</span> arr <span style="color: #333333;">=</span> words<span style="color: #333333;">.</span><span style="color: #0000cc;">split</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">" "</span><span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span>String <span style="color: #997700; font-weight: bold;">s:</span> arr<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(!</span>s<span style="color: #333333;">.</span><span style="color: #0000cc;">isEmpty</span><span style="color: #333333;">()</span> <span style="color: #333333;">&&</span> <span style="color: #333333;">!</span>trie<span style="color: #333333;">.</span><span style="color: #0000cc;">findWord</span><span style="color: #333333;">(</span>s<span style="color: #333333;">))</span> System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">println</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"couldn't find "</span> <span style="color: #333333;">+</span> s<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333399; font-weight: bold;">long</span> elapsed <span style="color: #333333;">=</span> System<span style="color: #333333;">.</span><span style="color: #0000cc;">currentTimeMillis</span><span style="color: #333333;">()</span> <span style="color: #333333;">-</span> st<span style="color: #333333;">;</span>
<span style="color: #333399; font-weight: bold;">long</span> mem2 <span style="color: #333333;">=</span> Runtime<span style="color: #333333;">.</span><span style="color: #0000cc;">getRuntime</span><span style="color: #333333;">().</span><span style="color: #0000cc;">totalMemory</span><span style="color: #333333;">()</span> <span style="color: #333333;">-</span> Runtime<span style="color: #333333;">.</span><span style="color: #0000cc;">getRuntime</span><span style="color: #333333;">().</span><span style="color: #0000cc;">freeMemory</span><span style="color: #333333;">();</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"memory usage at end %d\n"</span><span style="color: #333333;">,</span> mem2<span style="color: #333333;">);</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"took %d ms for %d words using %d MB\n"</span><span style="color: #333333;">,</span> elapsed<span style="color: #333333;">,</span> arr<span style="color: #333333;">.</span><span style="color: #0000cc;">length</span><span style="color: #333333;">,</span> <span style="color: #333333;">(</span>mem2 <span style="color: #333333;">-</span> mem1<span style="color: #333333;">)/</span><span style="color: #0000dd; font-weight: bold;">1024</span><span style="color: #333333;">/</span><span style="color: #0000dd; font-weight: bold;">1024</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
</pre>
</td></tr>
</tbody></table>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-7198321259918200002017-08-12T20:09:00.000-07:002017-08-12T20:09:45.317-07:00folding with python<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNI/h5fXdwA5EUgmjrGGTxleIj1Opf8WrfAzgCPcBGAYYCw/s1600/pylogo.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="116" data-original-width="116" src="https://3.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNI/h5fXdwA5EUgmjrGGTxleIj1Opf8WrfAzgCPcBGAYYCw/s1600/pylogo.jpeg" /></a></div>
<span style="font-family: "verdana" , "arial" , "helvetica" , sans-serif; font-size: x-small;">Solving this problem the functional way =></span><br />
<span style="font-family: "verdana" , "arial" , "helvetica" , sans-serif; font-size: x-small;"><br /></span>
<span style="font-family: "verdana" , "arial" , "helvetica" , sans-serif; font-size: x-small;">Find if a sorted list of positive numbers has duplicates.</span><br />
<span style="font-family: "verdana" , "arial" , "helvetica" , sans-serif; font-size: x-small;"><br /></span>
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
<br />
<div class="p1">
<span class="s1">>>> def has_dups(nums):</span></div>
<div class="p1">
<span class="s1">... return reduce (lambda x,y: ( x[0] or (y == x[1]), y), nums, (False,0))[0]</span></div>
<div class="p1">
<span class="s1">... </span></div>
<div class="p1">
<span class="s1">>>> has_dups([1])</span></div>
<div class="p1">
<span class="s1">False</span></div>
<div class="p1">
<span class="s1">>>> has_dups([1,1])</span></div>
<div class="p1">
<span class="s1">True</span></div>
<div class="p1">
<span class="s1">>>> has_dups([1,1,1])</span></div>
<div class="p1">
<span class="s1">True</span></div>
<div class="p1">
<span class="s1">>>> has_dups([1,2,4])</span></div>
<div class="p1">
<span class="s1">False</span></div>
<div class="p1">
<span class="s1">>>> has_dups([1,2,2])</span></div>
<div class="p1">
<span class="s1">True</span></div>
<div class="p1">
<span class="s1">>>> has_dups([1,2,2,3,4])</span></div>
<div class="p1">
<span class="s1">True</span></div>
<br />
<div class="p1">
<span class="s1">>>> </span></div>
<div>
<br /></div>
<div>
From the definition of reduce:<br />
<br />
<dt id="reduce" style="background-color: #fbe54e; font-family: sans-serif; font-size: 16px;"><code class="descname" style="background-color: transparent; font-size: 1.2em; font-weight: bold; padding: 0px 1px;">reduce</code><span class="sig-paren">(</span><em>function</em>, <em>iterable</em><span class="optional" style="font-size: 1.3em;">[</span>, <em>initializer</em><span class="optional" style="font-size: 1.3em;">]</span><span class="sig-paren">)</span><a class="headerlink" href="https://docs.python.org/2/library/functions.html#reduce" style="color: #c60f0f; font-size: 0.8em; padding: 0px 4px; text-decoration-line: none; visibility: hidden;" title="Permalink to this definition"></a></dt>
<dd style="background-color: white; font-family: sans-serif; font-size: 16px; line-height: 20.8px; margin-bottom: 10px; margin-left: 30px; margin-top: 3px; text-align: justify;"><div style="line-height: 20.8px;">
Apply <em>function</em> of two arguments cumulatively to the items of <em>iterable</em>, from left to right, so as to reduce the iterable to a single value. For example, <code class="docutils literal" style="background-color: #ecf0f3; font-size: 0.95em; padding: 0px 1px;"><span class="pre">reduce(lambda</span> <span class="pre">x,</span> <span class="pre">y:</span> <span class="pre">x+y,</span> <span class="pre">[1,</span> <span class="pre">2,</span> <span class="pre">3,</span> <span class="pre">4,</span> <span class="pre">5])</span></code> calculates <code class="docutils literal" style="background-color: #ecf0f3; font-size: 0.95em; padding: 0px 1px;"><span class="pre">((((1+2)+3)+4)+5)</span></code>. The left argument, <em>x</em>, is the accumulated value and the right argument, <em>y</em>, is the update value from the <em>iterable</em>. If the optional <em>initializer</em> is present, it is placed before the items of the iterable in the calculation, and serves as a default when the iterable is empty. If <em>initializer</em> is not given and <em>iterable</em> contains only one item, the first item is returned. Roughly equivalent to:</div>
<div style="line-height: 20.8px;">
<br /></div>
</dd></div>
<div>
To solve the problem, we need to check if two adjacent items have the same value. To do this the functional way, we walk through the list using the reduce operator. At each point in the list, the reduce operator applies the user supplied function to the previous output(x) and the current item(y) of the list.<br />
<br />
We need to remember if we see two adjacent items, and pass it as the output. But we also have to pass the current value so that at the next step, reduce can evaluate the given function. So we have a tuple as output from the function :<br />
<br />
(truth value whether we have seen two adjacent items, current item)<br />
<br />
We need to pass an initial value for the tuple (initializer). The truth value would be False initially, and we pass a zero as all elements of the list are positive. (False, 0)<br />
<br />
So within the lambda, we need to check if the current item is the same as the previous item (<span style="font-family: Menlo; font-size: 11px; font-variant-ligatures: no-common-ligatures;">y == x[1]</span>) but if we had already met this condition (<span style="font-family: Menlo; font-size: 11px; font-variant-ligatures: no-common-ligatures;">x[0]</span>), we need to pass this along.</div>
<div>
<br /></div>
<div>
One of the drawbacks of the fold is that there is no quick break from traversing the list once we find a duplicate. It is possible to raise an exception in lambda and force a termination that way, but I don't know of a clean way to terminate the walk of the complete list.<br />
<br /></div>
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style><style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-85405131558666169642017-08-07T15:25:00.003-07:002017-08-07T15:29:45.837-07:00Python 2.7 scoping bug<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNE/tkIK5VnyiiI94o1nsw3NSetKESQd5Z3BQCLcBGAs/s1600/pylogo.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="116" data-original-width="116" src="https://4.bp.blogspot.com/-e0cFWQSd4B4/WYjo2egPjjI/AAAAAAAABNE/tkIK5VnyiiI94o1nsw3NSetKESQd5Z3BQCLcBGAs/s1600/pylogo.jpeg" /></a></div>
Here is a piece of code that does not work on Python 2.7:<br />
<br />
<pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: "arial"; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> #!/usr/bin/python
def img_type(s):
return str(s)
print (img_type(50))
a = [img_dir for (img_dir, img_type) in [("a",1),("b",2)]]
print (img_type(20))
</code></pre>
<br />
On the second print, it raises a TypeError:<br />
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>TypeError: 'int' object is not callable
</code></pre>
<div class="p1">
<span class="s1"><br /></span></div>
The interpreter is incorrectly identifying the scoped variables img_dir, img_type to be in global scope. Since the function is of the same name, the variable takes precedence. Actually it overwrites the function!<br />
<div>
<br /></div>
<div>
We can see what is happening by looking at the globals().items() and locals().items(). Each is a list of tuples where each tuple contains the variable name and its currently assigned value. Here is a modified program that lists the variables, before and after we define the list comprehension:</div>
<div>
<br /></div>
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>#!/usr/bin/python
def img_type(s):
return str(s)
print (img_type(50))
print ("BEFORE")
print (globals().items())
print (locals().items())
a = [img_dir for (img_dir, img_type) in [("a",1),("b",2)]]
print ("AFTER")
print (globals().items())
print (locals().items())
print (img_type(20))
</code></pre>
<br />
This outputs:
<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>50
BEFORE
[('img_type', <function img_type at 0x7f740ad5eb18>), ('__builtins__', <module '__builtin__' (built-in)>), ('__file__', './proof1.py'), ('__package__', None), ('__name__', '__main__'), ('__doc__', None)]
[('img_type', <function img_type at 0x7f740ad5eb18>), ('__builtins__', <module '__builtin__' (built-in)>), ('__file__', './proof1.py'), ('__package__', None), ('__name__', '__main__'), ('__doc__', None)]
AFTER
[('img_type', 2), ('a', ['a', 'b']), ('__builtins__', <module '__builtin__' (built-in)>), ('img_dir', 'b'), ('__file__', './proof1.py'), ('__package__', None), ('__name__', '__main__'), ('__doc__', None)]
[('img_type', 2), ('a', ['a', 'b']), ('__builtins__', <module '__builtin__' (built-in)>), ('img_dir', 'b'), ('__file__', './proof1.py'), ('__package__', None), ('__name__', '__main__'), ('__doc__', None)]
Traceback (most recent call last):
File "./proof1.py", line 19, in <module>
print (img_type(20))
TypeError: 'int' object is not callable
</code></pre>
<br />
Notice how after the list comprehension the img_type() function got clobbered by the locally scoped variable by the same name.<br />
<br />
This is fixed as of Python 3.2. Here is the output running the second version of the program:<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>50
BEFORE
dict_items([('__name__', '__main__'), ('__doc__', None), ('__loader__', <_frozen_importlib_external.SourceFileLoader object at 0x7ff93f4b7780>), ('__file__', './proof1.py'), ('__builtins__', <module 'builtins' (built-in)>), ('__spec__', None), ('img_type', <function img_type at 0x7ff93f41c2f0>), ('__package__', None), ('__cached__', None)])
dict_items([('__name__', '__main__'), ('__doc__', None), ('__loader__', <_frozen_importlib_external.SourceFileLoader object at 0x7ff93f4b7780>), ('__file__', './proof1.py'), ('__builtins__', <module 'builtins' (built-in)>), ('__spec__', None), ('img_type', <function img_type at 0x7ff93f41c2f0>), ('__package__', None), ('__cached__', None)])
AFTER
dict_items([('__name__', '__main__'), ('__doc__', None), ('a', ['a', 'b']), ('__loader__', <_frozen_importlib_external.SourceFileLoader object at 0x7ff93f4b7780>), ('__file__', './proof1.py'), ('__builtins__', <module 'builtins' (built-in)>), ('__spec__', None), ('img_type', <function img_type at 0x7ff93f41c2f0>), ('__package__', None), ('__cached__', None)])
dict_items([('__name__', '__main__'), ('__doc__', None), ('a', ['a', 'b']), ('__loader__', <_frozen_importlib_external.SourceFileLoader object at 0x7ff93f4b7780>), ('__file__', './proof1.py'), ('__builtins__', <module 'builtins' (built-in)>), ('__spec__', None), ('img_type', <function img_type at 0x7ff93f41c2f0>), ('__package__', None), ('__cached__', None)])
20
</code></pre>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-54925800439334041902017-03-13T13:52:00.001-07:002017-03-13T13:54:27.459-07:00Docker: delete all tags of image<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-AnwusEdgJDU/WMcG0Qf0UvI/AAAAAAAABIA/8rtw4X6FUYc_Pr1yuuIksBiG87i5zmHigCLcB/s1600/docker.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-AnwusEdgJDU/WMcG0Qf0UvI/AAAAAAAABIA/8rtw4X6FUYc_Pr1yuuIksBiG87i5zmHigCLcB/s1600/docker.png" /></a></div>
Handy one liner to delete all tags of a particular docker image (on Linux):<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;"></pre>
</td><td><pre style="line-height: 125%; margin: 0;">docker images | grep rabbitmq | tr -s <span style="background-color: #fff0f0;">' '</span> | cut -d <span style="background-color: #fff0f0;">' '</span> -f 2 | xargs -I <span style="color: #333333;">{}</span> docker rmi docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq:<span style="color: #333333;">{}</span> </pre>
</td></tr>
</tbody></table>
</div>
<br />
Here are the image tags that this one-liner removes:<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;">1
2
3
4
5
6
7</pre>
</td><td><pre style="line-height: 125%; margin: 0;">thusharaw@denali ois-rabbitmq <span style="color: #333333;">(</span>thushara/rabbitv1<span style="color: #333333;">)</span>*<span style="color: #996633;">$ </span>docker images | grep rabbitmq
docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq 2017031300 6fb1ed865ae6 3 hours ago 179 MB
docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq 2017031307 6fb1ed865ae6 3 hours ago 179 MB
docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq 2017031007 21b84ae0586e 2 days ago 179 MB
docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq 201703100 43d51d243631 2 days ago 179 MB
docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq 201703107 08927efd735e 2 days ago 179 MB
docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq 201703109 646cb7852d8e 2 days ago 179 MB
</pre>
</td></tr>
</tbody></table>
</div>
<br />
Let's break down the on-liner:<br />
<br />
<br />
<pre style="line-height: 125%; margin: 0;">1) docker images => </pre>
<br />
get the images<br />
<br />
<pre style="line-height: 125%; margin: 0;">2) grep rabbitmq</pre>
<br />
find the tags you care about<br />
<br />
<pre style="line-height: 125%; margin: 0;">3) tr -s <span style="background-color: #fff0f0;">' '</span></pre>
<br />
convert all repeating contiguous spaces to a single space so we can easily index to a specific column, the tag in this case<br />
<br />
<pre style="line-height: 125%; margin: 0;">4) xargs -I <span style="color: #333333;">{}</span> docker rmi docker-staging-local.artifactory.corp.alleninstitute.org/rabbitmq:<span style="color: #333333;">{}</span> </pre>
<pre style="line-height: 125%; margin: 0;"></pre>
pass the thusly recovered tag to the `docker rmi` command, but use xargs to change the stdout of the previous cmd to a cmd line arg (which is what `docker rmi` works with)
Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-11265228.post-13750844066056968952016-08-13T05:52:00.002-07:002016-08-18T05:04:28.256-07:00Java : Don't compare Integer objects with ==<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-uTUV5k8P0yU/SUgnAftQ0NI/AAAAAAAAAC0/GyQOb9OrCT0sooAF2e8HAUNXNBISnQiLACPcB/s1600/cheeserjk7.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="155" src="https://1.bp.blogspot.com/-uTUV5k8P0yU/SUgnAftQ0NI/AAAAAAAAAC0/GyQOb9OrCT0sooAF2e8HAUNXNBISnQiLACPcB/s200/cheeserjk7.png" width="200" /></a></div>
In Java, == operator, if used between objects, compares their references. However, if you compare an object and a primitive type, Java "unboxes" the object to the primitive type and compares the primitives. This auto-conversion is a very pleasant feature of the language, but unfortunately, there is no unboxing if both types being compared to are objects.<br />
<br />
So comparing two Integer objects will not work.<br />
<br />
Comparing the intValue() on the objects is the way to go.<br />
<br />
Sometimes, we may forget that two integer objects are being compared. Think of using a count as the value of a hash map. This is an integer, but due to the way generics are implemented, the value needs to be an Integer object. Comparing counts will need to use the intValue of the Integer object, such as needed for sorting the map by count.<br />
<br />
The other relational operators (>, <, >=, <=) work as one might more intuitively expect. They use the Comparable.compare(T obj) method of the Integer class and work on the integer value held in the Integer instances being compared.<br />
<br />
It is not clear why the Java language designers chose to wire == operator very differently from the other relational operators.<br />
<br />
So the lesson is that whenever you work with Integer instances, all relational operators except the equality operator (==) works in the intuitive sense of comparing the integers. For equality, you could use one of the following:<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333399; font-weight: bold;">boolean</span> Integer<span style="color: #333333;">.</span><span style="color: #0000cc;">equals</span><span style="color: #333333;">(</span>Object obj<span style="color: #333333;">)</span>
<span style="color: #333399; font-weight: bold;">int</span> Integer<span style="color: #333333;">.</span><span style="color: #0000cc;">compareTo</span><span style="color: #333333;">(</span>Integer another<span style="color: #333333;">)</span>
integerObj1<span style="color: #333333;">.</span><span style="color: #0000cc;">intValue</span><span style="color: #333333;">()</span> <span style="color: #333333;">==</span> integerObj2<span style="color: #333333;">.</span><span style="color: #0000cc;">intValue</span><span style="color: #333333;">()</span>
</pre>
</div>
<br />
<br />Unknownnoreply@blogger.com108tag:blogger.com,1999:blog-11265228.post-27362761213329845772016-07-27T12:58:00.003-07:002016-07-27T12:58:50.663-07:00Optimal SolutionsHere is a <a href="https://www.hackerrank.com/challenges/java-dequeue">Hackerrank exercise</a> that allows us to think of improving program speed in multiple ways.<br />
This is my solution, after a few attempts trying to pass the last two test cases.<br />
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;"> 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46</pre>
</td><td><pre style="line-height: 125%; margin: 0;"> <span style="color: #008800; font-weight: bold;">import</span> <span style="color: #0e84b5; font-weight: bold;">java.util.*</span><span style="color: #333333;">;</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #008800; font-weight: bold;">class</span> <span style="color: #bb0066; font-weight: bold;">Contiguous</span> <span style="color: #333333;">{</span>
<span style="color: #008800; font-weight: bold;">public</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">main</span><span style="color: #333333;">(</span>String<span style="color: #333333;">[]</span> args<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
Scanner in <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> Scanner<span style="color: #333333;">(</span>System<span style="color: #333333;">.</span><span style="color: #0000cc;">in</span><span style="color: #333333;">);</span>
Deque<span style="color: #333333;"><</span>Integer<span style="color: #333333;">></span> deque <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> ArrayDeque<span style="color: #333333;"><</span>Integer<span style="color: #333333;">>();</span>
<span style="color: #333399; font-weight: bold;">int</span> n <span style="color: #333333;">=</span> in<span style="color: #333333;">.</span><span style="color: #0000cc;">nextInt</span><span style="color: #333333;">();</span>
<span style="color: #333399; font-weight: bold;">int</span> m <span style="color: #333333;">=</span> in<span style="color: #333333;">.</span><span style="color: #0000cc;">nextInt</span><span style="color: #333333;">();</span>
<span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">[]</span> arr <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> <span style="color: #333399; font-weight: bold;">int</span><span style="color: #333333;">[</span>n<span style="color: #333333;">];</span>
<span style="color: #333399; font-weight: bold;">int</span> maxUnqs <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
<span style="color: #333399; font-weight: bold;">int</span> unqs <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span>
Map<span style="color: #333333;"><</span>Integer<span style="color: #333333;">,</span> Integer<span style="color: #333333;">></span> map <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">new</span> HashMap<span style="color: #333333;"><</span>Integer<span style="color: #333333;">,</span> Integer<span style="color: #333333;">>();</span>
<span style="color: #008800; font-weight: bold;">for</span> <span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span> i <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">;</span> i <span style="color: #333333;"><</span> n<span style="color: #333333;">;</span> i<span style="color: #333333;">++)</span> <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">int</span> num <span style="color: #333333;">=</span> in<span style="color: #333333;">.</span><span style="color: #0000cc;">nextInt</span><span style="color: #333333;">();</span>
deque<span style="color: #333333;">.</span><span style="color: #0000cc;">add</span><span style="color: #333333;">(</span>num<span style="color: #333333;">);</span> <span style="color: #888888;">//adds to tail</span>
<span style="color: #888888;">//count occurrence of each number</span>
Integer c <span style="color: #333333;">=</span> map<span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>num<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>c <span style="color: #333333;">==</span> <span style="color: #008800; font-weight: bold;">null</span> <span style="color: #333333;">||</span> c <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
map<span style="color: #333333;">.</span><span style="color: #0000cc;">put</span><span style="color: #333333;">(</span>num<span style="color: #333333;">,</span><span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">);</span>
unqs<span style="color: #333333;">++;</span>
<span style="color: #333333;">}</span> <span style="color: #008800; font-weight: bold;">else</span> <span style="color: #333333;">{</span>
map<span style="color: #333333;">.</span><span style="color: #0000cc;">put</span><span style="color: #333333;">(</span>num<span style="color: #333333;">,</span> c<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>deque<span style="color: #333333;">.</span><span style="color: #0000cc;">size</span><span style="color: #333333;">()</span> <span style="color: #333333;">==</span> m<span style="color: #333333;">)</span> <span style="color: #333333;">{</span> <span style="color: #888888;">//we have added m elements; go compute uniques</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>unqs <span style="color: #333333;">==</span> m<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
maxUnqs <span style="color: #333333;">=</span> m<span style="color: #333333;">;</span> <span style="color: #888888;">//we can't do better</span>
<span style="color: #008800; font-weight: bold;">break</span><span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>unqs <span style="color: #333333;">></span> maxUnqs<span style="color: #333333;">)</span> <span style="color: #333333;">{</span>
maxUnqs <span style="color: #333333;">=</span> unqs<span style="color: #333333;">;</span>
<span style="color: #333333;">}</span>
<span style="color: #333399; font-weight: bold;">int</span> first <span style="color: #333333;">=</span> deque<span style="color: #333333;">.</span><span style="color: #0000cc;">remove</span><span style="color: #333333;">();</span> <span style="color: #888888;">//we don't need the head any more; keep queue small</span>
<span style="color: #333399; font-weight: bold;">int</span> firstCount <span style="color: #333333;">=</span> map<span style="color: #333333;">.</span><span style="color: #0000cc;">get</span><span style="color: #333333;">(</span>first<span style="color: #333333;">);</span>
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #333333;">(</span>firstCount <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">)</span> <span style="color: #333333;">{</span> <span style="color: #888888;">//the head had the only occurrence of the number, so uniques drop</span>
unqs<span style="color: #333333;">--;</span>
<span style="color: #333333;">}</span>
map<span style="color: #333333;">.</span><span style="color: #0000cc;">put</span><span style="color: #333333;">(</span>first<span style="color: #333333;">,</span> firstCount<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span><span style="color: #333333;">);</span> <span style="color: #888888;">//one less occurrence of the first element now</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
System<span style="color: #333333;">.</span><span style="color: #0000cc;">out</span><span style="color: #333333;">.</span><span style="color: #0000cc;">format</span><span style="color: #333333;">(</span><span style="background-color: #fff0f0;">"%d\n"</span><span style="color: #333333;">,</span> maxUnqs<span style="color: #333333;">);</span>
<span style="color: #333333;">}</span>
<span style="color: #333333;">}</span>
</pre>
</td></tr>
</tbody></table>
</div>
<br />
What strikes me here is that a lot of different things are being done under a single loop. The code is therefore hard to read.<br />
<br />
If the code is made to read well, we lose the speed.<br />
<br />
The most readable code will do something like this:<br />
<br />
1) Read the input into a collection (possibly an array)<br />
2) Make sub arrays of size m, starting from position 0, ending at position n-m<br />
3) For each sub array, count the uniques -> we could use a bitmap, a boolean array, or a hashmap for this<br />
4) Keep track of the maximum unique count<br />
<br />
However, this readable algorithm runs in O(nm) time, whereas the solution given above is O(n). The optimal solution keeps track of up to m integers read, and when the m+1 integer is read, it simply removes the effects of the first integer from the system. Thus we can go through each integer only once and solve the problem.<br />
<br />
But this alone was not enough to pass the last two test cases. If we don't update <unqs> variable for each loop iteration, but compute it from the <map> each time we have a queue of m items, the code is still too slow to pass, although the time complexity is the same. While the time complexity is the same, we roughly double the work here and times out.<br />
<br />
This slower solution would, on each m<span style="font-size: xx-small;">th</span> item go over the map, counting keys that mapped to non-zero values. This would take m steps, but since we do this only n/m times, the time complexity would still be O(nm), but it would mean that for each n/m time slot, we do m+m = 2m work, effectively taxing the CPU twice.<br />
<br />
Another solution I saw uses the map, but removes an element when its count drops to zero. Now we can use map.size() to calculate uniques. Since the map size can be read in constant time, this solution works as well to pass the tests.<br />
<br />
All this led me to think about this <a href="http://codekata.com/kata/kata08-conflicting-objectives/">Code Katta</a>. The goals we set out will determine the kind of program we write.<br />
<br />
Having said that, I believe it is possible to make this optimal program slightly more readable. If we encapsulate the state of the window of integers we are dealing with into a SlidingWindow class, we could implement four functions :<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;">1
2
3
4
5
6</pre>
</td><td><pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">class</span> <span style="color: #bb0066; font-weight: bold;">SlidingWindow</span> <span style="color: #333333;">{</span>
<span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">addNumberToWindow</span><span style="color: #333333;">(</span><span style="color: #333399; font-weight: bold;">int</span> num<span style="color: #333333;">)</span>
<span style="color: #333399; font-weight: bold;">boolean</span> <span style="color: #0066bb; font-weight: bold;">isFullWindow</span><span style="color: #333333;">()</span>
<span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">processFullWindow</span><span style="color: #333333;">()</span>
<span style="color: #333399; font-weight: bold;">int</span> <span style="color: #0066bb; font-weight: bold;">maxUnique</span><span style="color: #333333;">()</span>
<span style="color: #333333;">}</span>
</pre>
</td></tr>
</tbody></table>
</div>
<br />
This will at least break up the input processing logic from the state handling. Lines 16 - 24 would get folded into addNumberToWindow; Lines 26 - 40 would be inside processFullWindow.<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-71363765887535612012015-10-08T09:02:00.000-07:002015-10-08T09:02:42.556-07:00Cartesian product in Scala<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-L3mR4A7tIYk/VeuLsxTrDOI/AAAAAAAAA78/sQM1qrvFRPw/s1600/scala-icon.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="http://1.bp.blogspot.com/-L3mR4A7tIYk/VeuLsxTrDOI/AAAAAAAAA78/sQM1qrvFRPw/s200/scala-icon.png" width="200" /></a></div>
<span style="font-family: Courier New, Courier, monospace;">val s = Seq("1","2","3","4")</span><br />
<span style="font-family: Courier New, Courier, monospace;">val t = Seq("a","b","c","d","e","f")</span><br />
<span style="font-family: Courier New, Courier, monospace;">val u = Seq("x","y")</span><br />
<span style="font-family: Courier New, Courier, monospace;">val v = Seq("m","l","n","o","p")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">val sq = Seq(s,t,u,v)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">sq.foldLeft(Seq(""))((b,a) => b.flatMap{i=>a.map{j=>i+j}})</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">res9: Seq[String] = List(1axm, 1axl, 1axn, 1axo, 1axp, 1aym, 1ayl, 1ayn, 1ayo, 1ayp, 1bxm, 1bxl, 1bxn, 1bxo, 1bxp, 1bym, 1byl, 1byn, 1byo, 1byp, 1cxm, 1cxl, 1cxn, 1cxo, 1cxp, 1cym, 1cyl, 1cyn, 1cyo, 1cyp, 1dxm, 1dxl, 1dxn, 1dxo, 1dxp, 1dym, 1dyl, 1dyn, 1dyo, 1dyp, 1exm, 1exl, 1exn, 1exo, 1exp, 1eym, 1eyl, 1eyn, 1eyo, 1eyp, 1fxm, 1fxl, 1fxn, 1fxo, 1fxp, 1fym, 1fyl, 1fyn, 1fyo, 1fyp, 2axm, 2axl, 2axn, 2axo, 2axp, 2aym, 2ayl, 2ayn, 2ayo, 2ayp, 2bxm, 2bxl, 2bxn, 2bxo, 2bxp, 2bym, 2byl, 2byn, 2byo, 2byp, 2cxm, 2cxl, 2cxn, 2cxo, 2cxp, 2cym, 2cyl, 2cyn, 2cyo, 2cyp, 2dxm, 2dxl, 2dxn, 2dxo, 2dxp, 2dym, 2dyl, 2dyn, 2dyo, 2dyp, 2exm, 2exl, 2exn, 2exo, 2exp, 2eym, 2eyl, 2eyn, 2eyo, 2eyp, 2fxm, 2fxl, 2fxn, 2fxo, 2fxp, 2fym, 2fyl, 2fyn, 2fyo, 2fyp, 3axm, 3axl, 3axn, 3axo, 3axp, 3aym, 3ayl, 3ayn, 3ayo...</span>Unknownnoreply@blogger.com4tag:blogger.com,1999:blog-11265228.post-13184495443464715392015-09-05T17:32:00.000-07:002015-09-05T17:41:43.588-07:00Scala : Another example of collectFirst<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-L3mR4A7tIYk/VeuLsxTrDOI/AAAAAAAAA74/O-0iRrfm9js/s1600/scala-icon.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="http://3.bp.blogspot.com/-L3mR4A7tIYk/VeuLsxTrDOI/AAAAAAAAA74/O-0iRrfm9js/s200/scala-icon.png" width="200" /></a></div>
Say, we want to find the population of the native country of the first European bird in a list of birds. This can be done by flatmapping over the birds, thus ignoring the non-European birds, and then getting the head of the resulting list. But then, we have to go over each bird, although we need to just get the first bird. If the list is sufficiently large, this is very inefficient.<br />
<br />
The other approach is to use find to get the first European bird, and then simply get its native country's population. But here we have to get the native country of the bird twice. In this example, that is cheap, but if this information requires an expensive database retrieval, then we may not want to do it twice.
Where collectFirst becomes very important is when this operation is very expensive - imagine doing a statistical calculation that provides the value for the match.
<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>case class Bird(name: String)
object PlayCollect {
val birds = Seq(Bird("jaybird"), Bird("owl"), Bird("sparrow"))
val nativeCountries = Map( (Bird("jaybird")->"Malaysia"), (Bird("sparrow")->"Ireland"), (Bird("owl")->"Chile") )
val continents = Map( ("Malaysia" -> "Asia"), ("Ireland" -> "Europe"), ("Chile" -> "South America") )
val population = Map( ("Malaysia"->10000000), ("Ireland" -> 12000000), "Chile"->45000000 )
val popEuBird = new PartialFunction[Bird, Int] {
def apply(b: Bird) = {
continents(nativeCountries(b)) match {
case "Europe" => population(nativeCountries(b))
}
}
def isDefinedAt(b: Bird) = {
continents(nativeCountries(b)) match {
case "Europe" => true
case _ => false
}
}
}
def populationOfNativeCountryOfFirstEuropeanBird: Option[Int] = {
birds collectFirst popEuBird
}
def main (args: Array[String]) {
populationOfNativeCountryOfFirstEuropeanBird.foreach{pop => println(pop)}
}
}
</code></pre>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-25112974989571226212015-09-02T23:30:00.000-07:002015-09-05T17:46:11.873-07:00Scala : collectFirst example<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-L3mR4A7tIYk/VeuLsxTrDOI/AAAAAAAAA78/sQM1qrvFRPw/s1600/scala-icon.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="http://1.bp.blogspot.com/-L3mR4A7tIYk/VeuLsxTrDOI/AAAAAAAAA78/sQM1qrvFRPw/s200/scala-icon.png" width="200" /></a></div>
On occasion, you want to find the first occurrence of an item in a list, then transform it to another type. Using partial functions with collectFirst, we can accomplish this.
<br />
<br />
collectFirst will first call isDefinedAt to determine if apply should be called for the current item in the list. Thus it will skip over the non matching elements in the list without incurring a match exception.<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>trait Animal
case class Mammal(name: String) extends Animal
case class Bird(name: String) extends Animal
val animals = Seq(Mammal("elephant"), Mammal("tiger"), Bird("raven"), Mammal("monkey"), Bird("peacock"), Bird("sparrow"))
val matchBird = new PartialFunction[Animal, Bird] {
def apply(p: Animal) = {
p match {
case b@Bird(name) => b
}
}
def isDefinedAt(p: Animal) = {
p match {
case Bird(name) => true
case _ => false
}
}
}
animals collectFirst matchBird
res11: Option[Bird] = Some(Bird(raven))
</code></pre>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-11265228.post-42884995106166824582015-02-27T08:21:00.003-08:002015-02-27T08:21:51.761-08:00Titan : using native hadoop libraries on MacOSX<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-5BmcVNOV6Fs/VPCY211UPGI/AAAAAAAAA5k/Kc2hcmNzgVw/s1600/titan-logo.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-5BmcVNOV6Fs/VPCY211UPGI/AAAAAAAAA5k/Kc2hcmNzgVw/s320/titan-logo.png" /></a></div>
Once you have <a href="http://gauravkohli.com/2014/09/28/building-native-hadoop-v-2-4-1-libraries-for-os-x/">built the native hadoop libraries</a> on your MacOSX, you need to add this bit of code to bin/gremlin.sh so that it can find them:
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>if [ -e "${HADOOP_PREFIX}/lib/native/libhadoop.dylib" ]; then
LIB_PATH="${HADOOP_PREFIX}/lib/native"
if [ -n "${LD_LIBRARY_PATH:-}" ]; then
LD_LIBRARY_PATH="${LIB_PATH}:${LD_LIBRARY_PATH}" # For Linux
else
LD_LIBRARY_PATH="${LIB_PATH}"
fi
if [ -n "${DYLD_LIBRARY_PATH:-}" ]; then
DYLD_LIBRARY_PATH="${LIB_PATH}:${DYLD_LIBRARY_PATH}" # For Mac
else
DYLD_LIBRARY_PATH="${LIB_PATH}"
fi
fi
</code></pre>
<br />
The only oddity here is that the script uses "set -u" at the top, which makes bash complain if you use uninitialized variables. So you have to append ":-" to the variables that you are testing. You can see that in the lines that test LD_LIBRARY_PATH etc.
Unknownnoreply@blogger.com0