tag:blogger.com,1999:blog-30437092024-03-21T06:18:51.504-07:00Shantanu's Blog<b> Database Consultant </b>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.comBlogger1152125tag:blogger.com,1999:blog-3043709.post-57678098530559173352023-11-18T00:08:00.000-08:002023-11-18T00:08:07.428-08:00awk Case Study - 10<p>Formats its input into lines that are at most 60 characters long</p># fmt - format<br /># input: text<br /># output: text formatted into lines of &lt;= 60 characters<br /><br />awk '/./ { for (i = 1; i &lt;= NF; i++) addword($i) }<br />/^$/ { printline(); print "" }<br />END { printline() }<br /><br />function addword(w) {<br />&nbsp; &nbsp; if (length(line) + length(w) &gt; 60)<br />&nbsp; &nbsp; &nbsp; &nbsp; printline()<br />&nbsp; &nbsp; &nbsp; &nbsp; line = line " " w<br />&nbsp; &nbsp; }<br />function printline() {<br />&nbsp; &nbsp; if (length(line) &gt; 0) {<br />&nbsp; &nbsp; &nbsp; &nbsp; print substr(line, 2)<br />&nbsp; &nbsp; &nbsp; &nbsp; line = ""<br />&nbsp; &nbsp; }<br />}' long.txt&nbsp;&nbsp;shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-14719129876486732352023-11-17T23:54:00.000-08:002023-11-17T23:54:58.239-08:00awk Case Study - 9<p><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">Cliche generator, which creates new cliches out of old ones. The input is a set of sentences like</span></p><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"><div><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># cat&nbsp;</span>cliche.txt</div></span><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">A rolling stone:gathers no moss.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">History:repeats itself.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">He who lives by the sword:shall die by the sword.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">A jack of all trades:is master of none.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">Nature:abhors a vacuum.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">Every man:has a price.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">All's well that:ends well.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">where a colon separates subject from predicate. Our cliche program combines a random subject with a random predicate; with luck it produces the occasional mildly amusing aphorism:</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">A rolling stone repeats itself.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">History abhors a vacuum.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">Nature repeats itself.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">All's well that gathers no moss.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">He who lives by the sword has a price.</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># cliche - generate an endless stream of cliches</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># input: lines of form subject:predicate</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># output: lines of random subject and random predicate</span><div><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"><br /></span></div><div><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"><br /></span></div><div><span style="background-color: white;"><span style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"><div>awk 'BEGIN { FS = ":" }</div><div>{ x[NR] = $1; y[NR] = $2 }</div><div>END { for (;;) print x[randint(NR)], y[randint(NR)] }</div><div>function randint(n) {</div><div>&nbsp; &nbsp; return int(n *rand()) + 1&nbsp;</div><div>}' cliche.txt</div><div><br /></div><div>Don't forget that this program is intentionally an infinite loop.&nbsp;</div></span></span></div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-36201764514138569192023-11-17T23:39:00.000-08:002023-11-17T23:41:57.919-08:00 awk Case Study - 8<p>Isolate the words and aggregate the count for each word in an associative array. A word is a field without the punctuation marks like ? or ,</p><div># wordfreq - print number of occurrences of each word&nbsp;</div><div># input: text&nbsp;</div><div># output: number-word pairs sorted by number&nbsp;<br /></div><div><br /></div><div>awk '{<br />&nbsp; &nbsp; gsub (/I [ ., : ; I ? ( ) { } ] /, &nbsp;"" )<br />&nbsp; &nbsp; for (i = 1; i &lt;= NF; i++)<br />&nbsp; &nbsp; &nbsp; &nbsp; count[$i]++<br />&nbsp; &nbsp; }<br />END {<br />&nbsp; &nbsp; for (w in count)<br />&nbsp; &nbsp; &nbsp; &nbsp; print count[w], w | "sort -rn"<br />&nbsp; &nbsp; }' capitals</div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-29424107208781965242023-11-17T23:29:00.000-08:002023-11-17T23:42:19.210-08:00awk Case Study - 7<div>Print the names of the countries in Asia along with their populations and capitals:</div><div><br /></div><div># cat capitals<br /></div><div>USSR Moscow<br />Canada Ottawa<br />China Beijing<br />USA Washington<br />Brazil Brasilia<br />India New Delhi<br />Mexico Mexico<br />France Paris<br />Japan Tokyo<br />Germany Bonn<br />England London<br /><br /></div><div># cat countries<br />USSR 8649 275 Asia<br />Canada 3852 25 North America<br />China 3705 1032 Asia<br />USA 3615 237 North America<br />Brazil 3286 134 South America<br />India 1267 746 Asia<br />Mexico 762 78 North America<br />France 211 55 Europe<br />Japan 144 120 Asia<br />Germany 96 61 Europe<br />England 94 56 Europe<br /></div><div><br /></div><div><div>(make sure that the files are tab separated)</div><div><br /></div></div><div><div># awk 'BEGIN { FS = "\t"}<br />FILENAME == "capitals" {<br />&nbsp; &nbsp; cap[$1] = $2<br />}<br />FILENAME == "countries" &amp;&amp; $4 == "Asia" {<br />&nbsp; &nbsp; print $1, $3, cap[$1]<br />}' capitals countries</div><div><br /></div><div><br />USSR 275 Moscow<br />China 1032 Beijing<br />India 746 New Delhi<br />Japan 120 Tokyo<br /></div></div><div><br /></div><div>It would certainly be easier if we could just say something like</div><div><br /></div><div>continent ~ /Asia/ { print $country, $population, $capital}</div><p></p><div></div><p></p><div>and have a program figure out where the fields are and how to put them together. This is how we would phrase this query in qawk</div><div><br /></div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-30771093288561484292023-11-17T23:04:00.000-08:002023-11-17T23:41:20.910-08:00awk Case Study - 6<p>The following program arith presents a sequence of addition problems like</p><p>7 + 9 = ?</p><p>After each problem, the user types an answer. If the answer is right, the user is praised and presented with another problem. If the answer is wrong, the program asks for the answer again. If the user provides no answer at all, the right answer is printed before the next problem is presented. The program is invoked with one of two command lines:</p><p>awk -f arith<br />awk -f arith n</p><p>If there is an argument after ari th on the command line, the argument is used to limit the maximum size of the numbers in each problem. After this argument has been read, ARGV [ 1 ] is reset to "-" so the program will be able to read the answers from the standard input. If no argument is specified, the maximum size will be 10.<br /><br />BEGIN {<br />&nbsp; &nbsp; maxnum = ARGC &gt; 1 ? ARGV[1] : 10<br />&nbsp; &nbsp; ARGV[1] = "-"<br />&nbsp; &nbsp; srand()<br />&nbsp; &nbsp; do {<br />&nbsp; &nbsp; &nbsp; &nbsp; n1 = randint(maxnum)<br />&nbsp; &nbsp; &nbsp; &nbsp; n2 = randint(maxnum)<br />&nbsp; &nbsp; &nbsp; &nbsp; printf("%g + %g = ? ", n1, n2)<br />&nbsp; &nbsp; &nbsp; &nbsp; while ((input = getline) &gt; 0)<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ($0 == n1 + n2) {<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print "Right!"<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else if ($0 == "") {<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print n1 + n2<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; printf("wrong, try again: ")<br />&nbsp; &nbsp; } while (input &gt; 0)<br />}<br />function randint(n) { <br />return int(rand()*n)+1<br />&nbsp;}</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-71475736848666385702023-11-17T21:44:00.000-08:002023-11-17T21:44:06.466-08:00 awk Case Study - 5<p>Mail-merge using AWK</p><div><br /></div><div># cat letter.text<br />Subject: Demographic Information About #1 <br />From: AWK Demographics, Inc. <br />In response to your request for information about #1, our latest researchas revealed that its population is #2 million people and its area is #3 million square miles. This gives #1 a population density of #4 people per square mile. <br /><br /></div><div><br /></div><div># echo "Canada:25:3.852:6.5" | awk 'BEGIN {<br />&nbsp; &nbsp; FS = ":"<br />&nbsp; &nbsp; while (getline &lt; "letter.text" &gt; 0)<br />&nbsp; &nbsp; &nbsp; &nbsp; form[++n] = $0<br />&nbsp; &nbsp; } {<br />&nbsp; &nbsp; for (i = 1; i &lt;= n; i++) {<br />&nbsp; &nbsp; &nbsp; &nbsp; temp = form[i]<br />&nbsp; &nbsp; &nbsp; &nbsp; for (j = 1; j &lt;= NF; j++)<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; gsub("#" j, $j, temp)<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print temp<br />&nbsp; &nbsp; &nbsp; &nbsp; }<br />}'<br /></div><div></div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-4213707690436446162023-10-30T04:17:00.006-07:002023-11-17T21:44:35.927-08:00awk Case Study - 4<p><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;"># save the awk command as shell script and call it<br /></span><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;"># using the first variable</span></p><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;"># vi info</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">awk '</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">BEGIN { FS = "\t" }</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">$1 ~ /'$1'/ {</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">printf("%s:\n", $1)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">printf("\t%d million people\n", $3)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">printf("\t%.3f million sq. mi.\n", $2/1000)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">printf("\t%.1f people per sq. mi.\n", 1000*$3/$2)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;">}' countries.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;"># sh info Canada</span><div><span face="Arial, Helvetica, sans-serif" style="background-color: white; color: #222222; font-size: small;"><div># sh info 'Canada|USA'</div><div><br /></div></span></div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-39674697524058378152023-10-30T01:55:00.001-07:002023-11-17T21:45:28.726-08:00awk Case Study - 3<p>&nbsp;<span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># percent</span></p><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># input: a column of nonnegative numbers</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># output: each number and its percentage of the total</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">awk '{ x[NR] = $1; sum += $1 }</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">END { if (sum != 0)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">for (i = 1; i &lt;= NR; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("%10.2f %5.1f\n", x[i], 100*x[i]/sum)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}' test.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># histogram</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># input: numbers between 0 and 100</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># output: histogram of deciles</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">vi histogram.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">{ x[int($1/10)]++ }</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">END { for (i = 0; i &lt; 10; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("%2d- %2d: %3d %s\n",</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">10*i, 10*i+9, x[i], rep(x[i],"*"))</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("100: %3d %s\n", x[10], rep(x[10],"*"))</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">function rep(n,s, t) { # return string of n s's</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">while (n-- &gt; 0)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">t = t s</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">return t</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">awk '</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># generate random integers</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">BEGIN { for (i = 1; i &lt;= 200; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">print int(101*rand())</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}' | awk -f histogram.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># remove comma and then sum</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">awk '{gsub(/,/, ""); sum+= $0 } END { print sum }' test.txt</span><div><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"><br /></span></div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-74613982831704073242023-10-30T01:02:00.001-07:002023-11-17T21:45:46.938-08:00awk Case Study - 2<p><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">echo '1,2,3,4' &gt; test.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">echo '5,6,7,8' &gt;&gt; test.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">echo '9,10,11,12' &gt;&gt; test.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">awk -F ',' '{for (i = 1; i &lt;= NF; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">sum[i] += $i</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">if (NF &gt; maxfld)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">maxfld = NF</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">END { for (i = 1; i &lt;= maxfld; i++) {</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("%g", sum[i])</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">if (i &lt; maxfld)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("\t")</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">else</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("\n")</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}}' test.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># 15 &nbsp; &nbsp; &nbsp;18 &nbsp; &nbsp; &nbsp;21 &nbsp; &nbsp; &nbsp;24</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># missing entries are treated as zeros</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">awk -F ',' 'NR==1 { nfld = NF }</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">{ for (i = 1; i &lt;= NF; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">sum[i] += $i</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">if (NF != nfld)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">print " line " NR " has " NF " entries, not " nfld</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">END { for (i = 1; i &lt;= nfld; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("%g%s", sum[i], i &lt; nfld ? "\t" : "\n")</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}' test.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">&nbsp;</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># check that each line has the same number of fields</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># as line one</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">awk 'NR == 1 {nfld = NF</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">for (i = 1; i &lt;= NF; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">numcol[i] = isnum($i)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">{for (i = 1; i &lt;= NF; i++)</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">if (numcol[i])</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">sum[i] += $i</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">END { for (i = 1; i &lt;= nfld; i++) {</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">if (numcol[i])</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("%g", sum[i])</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">else</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf("--")</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">printf(i &lt; nfld ? "\t" : "\n")</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">}}</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;">function isnum(n) { return n ~ /^[+-]?[0-9]+$/ }' countries.txt</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># sum3 - print sums of numeric columns</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># input: rows of integers and strings</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># output: sums of numeric columns</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /><span style="background-color: white; color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;"># assumes every line has same layout</span><br style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small;" /></p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-69230855841700287222023-10-27T21:16:00.015-07:002023-11-17T21:46:08.351-08:00awk Case Study - 1<p>&nbsp;# cat countries.txt</p><p>USSR&nbsp; &nbsp; 8649&nbsp; &nbsp; 275&nbsp; &nbsp; &nbsp;Asia<br />Canada&nbsp; 3852&nbsp; &nbsp; 25&nbsp; &nbsp; &nbsp; North America<br />China&nbsp; &nbsp;3705&nbsp; &nbsp; 1032&nbsp; &nbsp; Asia<br />USA&nbsp; &nbsp; &nbsp;3615&nbsp; &nbsp; 237&nbsp; &nbsp; &nbsp;North America<br />Brazil&nbsp; 3286&nbsp; &nbsp; 134&nbsp; &nbsp; &nbsp;South America<br />India&nbsp; &nbsp;1267&nbsp; &nbsp; 746&nbsp; &nbsp; &nbsp;Asia<br />Mexico&nbsp; 762&nbsp; &nbsp; &nbsp;78&nbsp; &nbsp; &nbsp; North America<br />France&nbsp; 211&nbsp; &nbsp; &nbsp;55&nbsp; &nbsp; &nbsp; Europe<br />Japan&nbsp; &nbsp;144&nbsp; &nbsp; &nbsp;120&nbsp; &nbsp; &nbsp;Asia<br />Germany 96&nbsp; &nbsp; &nbsp; 61&nbsp; &nbsp; &nbsp; Europe<br />England 94&nbsp; &nbsp; &nbsp; 56&nbsp; &nbsp; &nbsp; Europe</p><p><br /></p><p>awk 'BEGIN { FS = "\t" # make tab the field separator<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;printf("%10s %6s %5s %s\n\n",<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "COUNTRY", "AREA", "POP", "CONTINENT") }&nbsp;</p><p>{printf("%10s %6d %5d %s\n", $1, $2, $3, $4)<br />&nbsp; &nbsp; area = area + $2<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;pop = pop + $3}</p><p>END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) } ' countries.txt</p><div><br /></div><div><div># check if country name starts ith Japa</div><div>awk '$1 ~ /Japa/' countries.txt</div><div><br /></div><div><div># &amp; is special. It echoes the original string change to aba</div><div>awk '{ gsub(/a/, "&amp;b&amp;"); print }' countries.txt</div></div><div><br /></div><div><div># Change country names</div></div><div>awk '{ gsub(/North America/, "NA"); print }' countries.txt</div><div><br /></div><div># Change country names</div><div>awk 'BEGIN {FS = OFS = "\t"}</div><div>$4 == "North America" {$4 = "NA" }</div><div>$4 == "South America" {$4 = "SA" }</div><div>{print }' countries.txt</div><div><br /></div><div># add a column for computation</div><div>awk 'BEGIN { FS = OFS = "\t" }</div><div>{ $5 = 1000 * $3 / $2; print }' countries.txt</div><div><br /></div><div># conditional expression&nbsp;</div><div>awk '{ print ($2 &lt; 100 ? "less than 100" : "$2 greater than 100 " NR) }' countries.txt</div><div><br /></div><div># Assignment Operators</div><div><br /></div><div>awk '$4 == "Asia" { pop = pop + $3; n = n + 1 }</div><div>END { print "Total population of the", n,</div><div>"Asian countries is", pop, "million."}' countries.txt</div><div><br /></div><div>awk '$3 &gt; maxpop { maxpop = $3; country = $1 }</div><div>END {print "country with largest population:",</div><div>country, maxpop }' countries.txt</div><div><br /></div><div># second field is a string of digits</div><div>awk 'BEGIN { digits = "^[0-9]+$" } $2 ~ digits' countries.txt</div><div><br /></div><div><br /></div><div># regular expression built up from components</div><div><br /></div><div>awk 'BEGIN {</div><div>sign = "[+-]?"</div><div>decimal= "[0-9]+[.]?[0-9]*"</div><div>fraction= "[.][0-9]+"</div><div>exponent= "([eE]" sign "[0-9]+)?"</div><div>number= "^" sign "(" decimal "!" fraction ")" exponent "$"</div><div>}</div><div>$1 ~ number' countries.txt</div><div><br /></div><div><br /></div></div><div><div># Setting $1 forces awk to recompute $0&nbsp;</div><div># the fields are now separated by a blank&nbsp;</div><div># (the default value of OFS), no longer by a tab</div><div>awk '{ $1 = substr($1, 1, 3); print $0 }' countries.txt</div></div><div><br /></div><div><div># print first 3 chars of first column in a line</div><div>awk '{ s = s substr ( $1 , 1 , 3 ) " " } END { print substr(s, 1, length(s)-1) }' countries.txt</div></div><div><br /></div><div><div># group by sum</div><div>awk '/Asia/ { pop["Asia"] += $3 }</div><div>/Europe/ { pop["Europe"] += $3}</div><div>END {</div><div>print "Asian population is", pop["Asia"], "million."</div><div>print "European population is", pop["Europe"], "million."}' countries.txt</div></div><div><br /></div><div><div># group by sum SQL like query</div><div>awk 'BEGIN { FS = "\t" }&nbsp; {pop[$4] += $3 } END { for (name in pop) print name, pop[name] }' countries.txt</div></div><div><br /></div><div><br /></div><div><div># reverse order using array</div><div>awk '{ x[NR] = $0 }</div><div>END { for (i = NR; i &gt; 0; i--) print x[i] }' countries.txt</div></div><div><br /></div><div><div># return all fields</div><div><br /></div><div>awk '{i = 1; while (i &lt;= NF) { print $i i++ }}' countries.txt</div><div><br /></div><div>awk '{for (i = 1; i &lt;= NF; i++) print $i }' countries.txt</div></div><div><br /></div><div><div># User Defined Function</div><div>awk '{print max($1, max(S2,S3))} #print maximum of $1, $2, $3</div><div>function max(m, n) {</div><div>return m &gt; n ? m : n</div><div>}' countries.txt</div></div><div><br /></div><div># Output Separators</div><div><div>awk 'BEGIN { OFS = ":"; ORS = "\n\n" } { print $1, $2 }' countries.txt</div></div><div><br /></div><div><div># redirect to file</div><div>awk '$3 &gt; 100 { print $1, $3 &gt;"bigpop.txt" }' countries.txt</div></div><div><br /></div><div>awk '{ print($1, $3) &gt; ($3 &gt; 100 ? "big.txt" : "small.txt") }' countries.txt</div><div><br /></div><div><br /></div><div><div># Field Separator</div><div><br /></div><div>awk 'BEGIN {FS = ",[ \t]*![ \t]+"} {print}' countries.txt</div><div><br /></div><div>awk -F ',[ \t]*![ \t]+' '{print}' countries.txt</div></div><div>_____</div><div><br /></div><div><div>awk 'BEGIN { FS = "\t" }</div><div>{ printf("%s:%s:%d:%d:%.1f\n", $4, $1, $3, $2, 1000*$3/$2) | "sort -t: +0 -1 +4rn"}' countries.txt</div><div><br /></div><div><br /></div><div>awk 'BEGIN { FS = "\t" }</div><div>{ printf("%s:%s:%d:%d:%.1f\n", $4, $1, $3, $2, 1000*$3/$2) | "sort -t: +0 -1 +4rn"}' countries.txt | awk 'BEGIN { FS = ":"</div><div>printf("%-15s %-10s %10s %7s %12s\n",</div><div>"CONTINENT", "COUNTRY", "POPULATION", "AREA", "POP. DEN.")}</div><div>{printf("%-15s %-10s %7d %10d %10.1£\n", $1, $2, $3, $4, $5)</div><div>}'</div><div><br /></div><div>awk 'BEGIN { FS = "\t" }</div><div>{ printf("%s:%s:%d:%d:%.1f\n", $4, $1, $3, $2, 1000*$3/$2) | "sort -t: +0 -1 +4rn"}' countries.txt | awk 'BEGIN { FS = ":"</div><div>printf("%-15s %-10s %10s %7s %12s\n",</div><div>"CONTINENT", "COUNTRY", "POPULATION", "AREA", "POP. DEN.")}</div><div>{if ( $1 != prev) {</div><div>print ""</div><div>prev = $1</div><div>} else</div><div>$1 = ""</div><div>printf("%-15s %-10s %7d %10d %10.1f\n", $1, $2, $3, $4, $5)}'</div><div><br /></div><div>awk 'BEGIN { FS = "\t"}</div><div>{ den = 1000*$3/$2; printf("%-15s:%12.8f:%s:%d:%d:%.1f\n",</div><div>$4, 1/den, $1, $3, $2, den) | "sort"}'&nbsp; countries.txt</div><div>_____</div><div><br /></div><div>vi prep3.txt</div><div><br /></div><div>BEGIN { FS = "\t"}</div><div>pass == 1 {</div><div>area[$4] += $2</div><div>areatot += $2</div><div>pop[$4] += $3</div><div>poptot += $3</div><div>}</div><div>pass == 2 {</div><div>den = 1000*$3/$2</div><div>printf("%s:%s:%s:%f:%d:%f:%f:%d:%d\n",</div><div>$4, $1, $3, 100*$3/poptot, $2, 100*$2/areatot,</div><div>den, pop[$4], area[$4]) | "sort -t: +0 -1 +6rn"</div><div>}</div><div><br /></div><div>awk -f prep3.txt pass=1 countries.txt pass=2 countries.txt</div><div><br /></div></div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-26093085189838142782023-09-27T23:43:00.002-07:002023-09-27T23:43:35.791-07:00 Enable S3 versioning<p>It is highly recommended to enable versioning for all your S3 buckets. In case you delete a file by mistake, you can find the version ID of the older version and donload that copy.</p><p>1) Find the version ID:<br />aws s3api list-object-versions --bucket cdk-hnb&nbsp; --prefix tt/archive.tar</p><p>2) Donload that version:<br />aws s3api get-object --bucket cdk-hnb --key tt/archive.tar archive.tar --version-id mqfya</p><p>3) List all versions:<br />aws s3api list-object-versions --bucket cdk-hnb</p><p>If versioning is enabled, you can not remove the bucket easily. You first need to "empty" the bucket and then delete it.</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-83317503736190291742023-07-25T03:07:00.000-07:002023-07-25T03:07:30.886-07:00 tar small files in a single object of S3<p>There are times when you need to take backup of a folder in a new bucket before deleting the folder. Let's assume I have hundreds of files in /test/ folder. I can tar all the files in a single file and save it to a different location. I use "CloudShell" utility that is part of aws console.</p><p>I will download the excellent utility called s3tar from...</p><p>wget https://github.com/awslabs/amazon-s3-tar-tool/releases/download/v1.0.10/s3tar-linux-amd64.zip</p><p>unzip s3tar-linux-amd64.zip<br />cd s3tar-linux-amd64</p><p>Now it is easy to collect all files in a single object.</p><p>./s3tar-linux-amd64 --region us-east-1 -cvf s3://cdk-hnb659/archive.tar&nbsp; s3://cf-templates-9f/test/</p><p>Please note:&nbsp;</p><p></p><ul style="text-align: left;"><li>It does not compress the file. It simply collects all the objects.</li><li>It would cost around $0.1 to create an archive of 10,000 files.</li><li>The cumulative size of the TAR must be over 5MB</li><li>The final size cannot be larger than 5TB</li></ul><p></p><p>https://github.com/awslabs/amazon-s3-tar-tool</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-7198066651539525152023-07-21T23:55:00.008-07:002023-09-25T00:22:27.434-07:00 langchain for pandas<p>langchain is a module to query pandas dataframe using Natural Language. It uses chatGPT to build pandas commands!</p><p>!pip install langchain<br />import os<br />os.environ["OPENAI_API_KEY"] = "XXXX"</p><p>from langchain.agents import create_pandas_dataframe_agent<br />from langchain.llms import OpenAI<br />import pandas as pd</p><p>pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)</p><p>pd_agent.run("Find the total sales for each product line in the year 2003")</p><p>_____</p><p>Something similar...</p><p><br /></p><p># https://github.com/gventuri/pandas-ai</p><p>!pip install pandasai<br />from pandasai import SmartDataframe, SmartDatalake<br />from pandasai.llm import OpenAI<br />llm = OpenAI(api_token="YOUR TOKEN")</p><p>sdf = SmartDataframe(df, config={"llm": llm})<br />sdf.chat("Return the top 5 countries by GDP")<br />sdf.chat("Plot a chart of the gdp by country")</p><p>print(sdf.last_code_generated)</p><p>If you have more than one dataframe, then use SmartDatalake method and supply a list of dataframes. For e.g.</p><p>sdf = SmartDatalake([df, df2, df3], config={"llm": llm})</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-38840781056780782752023-04-30T20:56:00.003-07:002023-04-30T20:56:37.100-07:00 Copy any archive.org pdf file to wikisource<p>Let's assume we need this pdf in text format.</p><p>https://archive.org/details/SukavihRdayAnandinI</p><p>You need to visit the page here...</p><p>https://tinyurl.com/3s567p34</p><p>(You may have to log-in. use your wikipedia username)</p><p>Type the name "SukavihRdayAnandinI" in both the input boxes. Submit the file and it will be available on commons in the next few minutes.</p><p>_____</p><p>Now visit the sanskrit wikisource page by adding the .djvu name to title=अनुक्रमणिका: part. This will create your book in wikisource.</p><p>https://sa.wikisource.org/w/index.php?title=अनुक्रमणिका:SukavihRdayAnandinI.djvu</p><p>Easy. Right?</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-85993283959296739672023-03-31T23:21:00.002-07:002023-03-31T23:21:20.080-07:00 summarize text using chatGPT<p>This function takes any text and summarizes it.</p><p>from marvin import ai_fn</p><p>@ai_fn<br />def summarize(text: str) -&gt; str:</p><p>&nbsp; &nbsp; """Summarize the provided text"""</p><p>import wikipedia<br />page = wikipedia.page('large language model')<br />summarize(text=page.content)</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-81554458566535640922023-03-31T23:17:00.003-07:002023-03-31T23:17:43.448-07:00Using chatGPT to generate fake records for pandas<p>I can use chatGPT to generate fake data for testing in pandas dataframes in python!</p><p>import pandas as pd<br />from marvin import ai_fn</p><p>@ai_fn<br />def fake_people(n: int) -&gt; list[dict]:</p><p>&nbsp; &nbsp; """<br />&nbsp; &nbsp; Generates n examples of fake data representing people,&nbsp; each with a name and an age.</p><p>&nbsp; &nbsp; """</p><p>myfake=fake_people(3)<br />df = pd.DataFrame(myfake)<br />print (df)</p><p># will return the dataframe something like this:</p><p>&nbsp; &nbsp; name&nbsp; age<br />0&nbsp; &nbsp;John&nbsp; &nbsp;28<br />1&nbsp; Emily&nbsp; &nbsp;35<br />2&nbsp; David&nbsp; &nbsp;19</p><p># docker run --rm -it --entrypoint /bin/bash python:3.10<br /># pip install marvin pandas</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-81569153924629253102023-03-15T01:54:00.003-07:002023-03-15T02:02:45.349-07:00 Text embedding model - part II<p>I compared the text similarity algorithms for the folloing strings:</p><p>string_a = 'MUKESH VITHAL GURAV'<br />string_b = 'MUKESH VITTHAL GURAO'</p><p>I found that all the popular methods like cosine, levenshtein, jaro do not work. The best performance was by "entropy_ncd" that is built using word embeddings. The openai model returned the similar (98.10%) score.</p><p>import textdistance<br />mydic = dict()<br />for i in myalgs:<br />&nbsp; &nbsp; try:</p><p>&nbsp; &nbsp; &nbsp; &nbsp; exec('x = textdistance.algorithms.'+i+'.normalized_similarity("'+string_a+'", "'+string_b+'")')<br />&nbsp; &nbsp; &nbsp; &nbsp; mydic[i] = round(x,4)&nbsp;</p><p>&nbsp; &nbsp; except:<br />&nbsp; &nbsp; &nbsp; &nbsp; pass</p><p>dict(sorted(mydic.items(), key=lambda item: item[1]))</p><p>&nbsp;'levenshtein': 0.9,<br />&nbsp;'ratcliff_obershelp': 0.9231,</p><p>&nbsp;'sorensen': 0.9231,<br />&nbsp;'sorensen_dice': 0.9231,<br />&nbsp;'cosine': 0.9234,<br />&nbsp;'needleman_wunsch': 0.925,<br />&nbsp;'gotoh': 0.9474,<br />&nbsp;'overlap': 0.9474,<br />&nbsp;'jaro': 0.9491,<br />&nbsp;'editex': 0.95,<br />&nbsp;'length': 0.95,<br />&nbsp;'jaro_winkler': 0.9695,<br />&nbsp;'strcmp95': 0.9695,<br />&nbsp;'entropy_ncd': 0.9815</p><p>import openai<br />from openai.embeddings_utils import get_embedding, cosine_similarity<br />openai.api_key = "sk-xxx"</p><p>string_b_embed = get_embedding(string_b, engine="text-similarity-davinci-001")<br />string_a_embed = get_embedding(string_a, engine="text-similarity-davinci-001")<br />cosine_similarity(string_a_embed, string_b_embed)</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-1328657191372603022023-03-01T23:12:00.003-08:002023-03-01T23:20:04.335-08:00Text embedding model by openAI to solve string similarity problem<p>While everyone is discussing about chatGPT by openAI, I decided to try it with an old puzzle that I could not solve for a very long time. The problem is about finding the similar names where almost&nbsp; 50% words are non-English.</p><p>Following strings are semantically similar but there&nbsp; is no algorithm that can tell the truth. For e.g. the fuzzy similarity match will return 82 and 64 percent score while there are several other entries in that range.&nbsp;</p><p>from thefuzz import fuzz</p><p>fuzz.ratio('A.R.B.GARUD ARTS, COMM.&amp; SCIENCE COLLEGE SHENDURNI', 'A.R.B.Garud ARTS, COM, SCI. COLLEGE SHENDURNI') # 82%&nbsp;</p><p>fuzz.ratio('AADARSH HIGH SCHOOL MANDAL',&nbsp; 'AADARSHA VIDYA. MAR.HIGHSCHOOL') # 64%</p><p>Text embeddings by openai (text-embedding-ada-002) returned more than 95% similarity score for both the entries and this is very surprising as well as encouraging. It is very close to human accuracy if not better.</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-57703171400118594672022-11-01T02:18:00.010-07:002022-11-01T02:20:28.645-07:00Check for open ports<p>This code will check if there is any port open and send an alert to the subscribers of SNS topic.<br /><br />import boto3, json<br />ec2 = boto3.client('ec2' , region_name='us-east-1')<br />for security_group in ec2.describe_security_groups()['SecurityGroups']:<br />&nbsp; for i in range(len(security_group['IpPermissions'][0]['IpRanges'])):<br />&nbsp; &nbsp; for k,v in security_group['IpPermissions'][0]['IpRanges'][i].items():<br />&nbsp; &nbsp; &nbsp; &nbsp; print (k, v)<br />&nbsp; &nbsp; &nbsp; &nbsp; if '0.0.0.0' in v:<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; message = {"alert": "open port found "}<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sns_client = boto3.client("sns", region_name="us-east-1")<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; response = sns_client.publish(TargetArn='arn:aws:sns:us-east-1:102378362623:NotifyMe', <br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Message=json.dumps({'default': json.dumps(message)}), MessageStructure='json') &nbsp; &nbsp; &nbsp;<br /><br />You may need to change the region name and SNS topic ARN address in the code mentioned above.<br />This code can be written as Lambda function and run every day.</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-84344314446315916152022-10-14T02:15:00.002-07:002022-10-14T02:15:17.564-07:00 Finding the number of google search results<p>Here is the python code to return the number of results returned by google search query.</p><p><br /></p><p>import requests</p><p>from bs4 import BeautifulSoup</p><p>from urllib.parse import urlparse</p><p>headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36','referer':'https://www.google.com'}</p><p>mylist = list()</p><p>def get_count(word):</p><p>&nbsp; &nbsp; target_url='https://www.google.com/search?q='+word+'&amp;num=200'</p><p>&nbsp; &nbsp; resp = requests.get(target_url, headers=headers)</p><p>&nbsp; &nbsp; soup=BeautifulSoup(resp.text,'html.parser')</p><p>&nbsp; &nbsp; results = soup.find_all("div",{"class":"jGGQ5e"})</p><p>&nbsp; &nbsp; print (word, len(results))</p><p>&nbsp; &nbsp; for x in range(0,len(myresults)):</p><p>&nbsp; &nbsp; &nbsp; &nbsp; domain=urlparse(myresults[x].find("div",{"class":"yuRUbf"}).find("a").get("href")).netloc</p><p>&nbsp; &nbsp; &nbsp; &nbsp; mylist.append(domain)</p><p>&nbsp; &nbsp; return mylist</p><p><br /></p><p>Now when I call the function, I get to know how many results were returned for a given query:</p><p>get_count('india')</p><p>This will show</p><p>india 98</p><p>['en.wikipedia.org',</p><p>&nbsp;'www.india.gov.in',</p><p>&nbsp;'www.state.gov', ... ... ]</p><p>There are 98 domains returned for a given search term.</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-77785778292872919082022-03-16T21:48:00.001-07:002022-03-16T21:48:06.369-07:00Interactive pandas dataframe<p>iTables is an important utility for pandas dataframe. It will make the df or series interactive.</p><p>https://github.com/mwouts/itables</p><p>Install the package with:</p><p>pip install itables</p><p>Activate the interactive mode:</p><p>from itables import init_notebook_mode<br />init_notebook_mode(all_interactive=True)</p><p>or use itables.show to show just one Series or DataFrame as an interactive table.</p><p>_____</p><p>1) At the moment itables does not have an offline mode. While the table data is embedded in the notebook, the jquery and datatables.net are loaded from a CDN.</p><p>2) When the data in a table is larger than maxBytes, which is equal to 64KB by default, itables will display only a subset of the table. To show the table in full, modify the value of maxBytes either locally:</p><p>show(df, maxBytes=0)</p><p>or globally:</p><p>import itables.options as opt</p><p>opt.maxBytes = 2 ** 20</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-23443595857272577612021-09-18T23:34:00.006-07:002021-09-18T23:39:22.685-07:00 wikipedia search <p>Google search works very well but it is not useful for certain "tasks". For e.g. when I am searching for heritage sites, I google...</p><p>unesco world heritage sites in india</p><p>Usually the first site is the official and the best one to start with...</p><p>https://whc.unesco.org/en/statesparties/in</p><p>The problem with this site is that there is no way to "download" the list in excel. I can visit the relevant wikipedia page and download the table found on that page. But there is a better way to search and download results.</p><p><a href="https://tinyurl.com/36wrc3pw">https://tinyurl.com/36wrc3pw</a></p><p>This is visual query builder for wikipedia. Every entity as been given an ID and we can use SQL query like interface to get the "data" in tabular format. Once I download the csv file, I can open it in libre office and convert the image links to actual images using an excellent add-on that can be installed from...</p><p>https://extensions.libreoffice.org/en/extensions/show/links-to-images</p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-40939497198646520162021-07-07T05:57:00.001-07:002021-07-07T05:57:06.834-07:00pandas case study 34<p>moving average for the variable length window</p><br />How do I calculate the moving average where the set of previous rows is not fixed? For e.g. this SQL query will calculate the average for the totalprice column for the last 1 month orders.<br /><br />SELECT avg(totalprice) OVER (<br />&nbsp; &nbsp; PARTITION BY custkey<br />&nbsp; &nbsp; ORDER BY orderdate<br />&nbsp; &nbsp; RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)<br />FROM orders<br /><br />I have this dataframe:<br /><br />from io &nbsp;import StringIO<br />import pandas as pd<br /><br />myst="""cust_1,2020-10-10,100<br />cust_2,2020-10-10,15<br />cust_1,2020-10-15,200<br />cust_1,2020-10-16,240<br />cust_2,2020-12-20,25<br />cust_1,2020-12-25,140<br />cust_2,2021-01-01,5<br />"""<br /><br />u_cols=['custkey', 'orderdate', 'totalprice']<br /><br />myf = StringIO(myst)<br />import pandas as pd<br />orders = pd.read_csv(StringIO(myst), sep=',', names = u_cols)<br />orders['orderdate'] = pd.to_datetime(orders['orderdate']) <br />df=df.sort_values(list(df.columns))<br /><br /><br />Answer:<br /><br />orders['my_average'] = (orders.groupby('custkey')<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .apply(lambda d: d.rolling('30D', on='orderdate')['totalprice'].mean())<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .reset_index(level=0, drop=True)<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .astype(int)<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;)<div><br /></div><div>https://stackoverflow.com/questions/68268531/window-function-for-moving-average</div>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-26756280875799701842021-07-06T00:26:00.003-07:002021-07-06T00:26:19.077-07:00 Quick test tables in Athena<p>If you quickly ant to test data using athena query, use with syntax as shown below:</p><br />WITH countries(country_code) AS (VALUES 'pol', 'CAN', 'USA')<br />SELECT upper(country_code) AS country_code<br />&nbsp; &nbsp; FROM countries<br />&nbsp; <br />In this example aggregate expression uses "OVER" function.<br />&nbsp; <br />WITH students_results(student_id, result) AS (VALUES<br />&nbsp; &nbsp; ('student_1', 17),<br />&nbsp; &nbsp; ('student_2', 16),<br />&nbsp; &nbsp; ('student_3', 18),<br />&nbsp; &nbsp; ('student_4', 18),<br />&nbsp; &nbsp; ('student_5', 10),<br />&nbsp; &nbsp; ('student_6', 20),<br />&nbsp; &nbsp; ('student_7', 16))<br />SELECT<br />&nbsp; &nbsp; student_id,<br />&nbsp; &nbsp; result,<br />&nbsp; &nbsp; count(*) OVER (<br />&nbsp; &nbsp; &nbsp; &nbsp; ORDER BY result<br />&nbsp;) AS close_better_scores_count<br />FROM students_results<br />shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0tag:blogger.com,1999:blog-3043709.post-77457438319716489272021-05-28T18:31:00.002-07:002021-05-29T01:48:08.617-07:00 Athena and Unicode text<p>Athena supports unicode characters very well. For e.g. if the datafile looks like this...<br /><br />"Root_word";"Word";"Primary";"Type";"Code";"Position";"Rule"<br />"अँटिबायोटिक","अँटिबायोटिक","अँटिबायोटिक","Primary","","",""<br />"अँटिबायोटिक","अँटिबायोटिकअंती","अँटिबायोटिक","Suffix","A","7293","001: 0 अंती ."<br />"अँटिबायोटिक","अँटिबायोटिकअर्थी","अँटिबायोटिक","Suffix","A","7293","002: 0 अर्थी ."<br />"अँटिबायोटिक","अँटिबायोटिकआतून","अँटिबायोटिक","Suffix","A","7293","003: 0 आतून ."<br />"अँटिबायोटिक","अँटिबायोटिकआतूनचा","अँटिबायोटिक","Suffix","A","7293","004: 0 आतूनचा ."<br />"अँटिबायोटिक","अँटिबायोटिकआतूनची","अँटिबायोटिक","Suffix","A","7293","005: 0 आतूनची ."<br />"अँटिबायोटिक","अँटिबायोटिकआतूनचे","अँटिबायोटिक","Suffix","A","7293","006: 0 आतूनचे ."<br />"अँटिबायोटिक","अँटिबायोटिकआतूनच्या","अँटिबायोटिक","Suffix","A","7293","007: 0 आतूनच्या ."<br />"अँटिबायोटिक","अँटिबायोटिकआतूनला","अँटिबायोटिक","Suffix","A","7293","008: 0 आतूनला ."<br /><br />This create table statement is all I need...<br /><br />create external table myptg (<br />root_word varchar(255),<br />derived_word varchar(255),<br />stemmed_word varchar(255),<br />type varchar(255),<br />code varchar(255),<br />position varchar(255),<br />rule varchar(255)<br />)<br />ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'<br />WITH SERDEPROPERTIES (<br />&nbsp; 'separatorChar' = '\;',<br />&nbsp; 'quoteChar' = '\"',<br />&nbsp; 'escapeChar' = '\\'<br />)<br />LOCATION 's3://ptg1/mc/'<br />TBLPROPERTIES ("skip.header.line.count"="1");<br /><br />I can create a supporting table like this...<br /><br />create external table gamabhana (derived_word varchar(255))<br />LOCATION 's3://ptg1/mc2/'<br />TBLPROPERTIES ("skip.header.line.count"="1");</p><p></p><p>A new table can be created using the syntax something like this...<br />&nbsp;<br />create external table anoop (<br />serial_number int,<br />root_word varchar(255),<br />stem1_word varchar(255),<br />stem2_word varchar(255),<br />stem3_word varchar(255),<br />stem4_word varchar(255),<br />stem5_word varchar(255),<br />stem6_word varchar(255),<br />stem7_word varchar(255)<br />)<br />ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'<br />WITH SERDEPROPERTIES (<br />&nbsp; 'field.delim' = ',',<br />&nbsp; 'escape.delim' = '\\',<br />&nbsp; 'line.delim' = '\n'<br />)<br />LOCATION 's3://ptg1/mc3/'<br />TBLPROPERTIES ("skip.header.line.count"="1");<br /><br /></p><p></p><p></p><p></p><p></p><p></p><p><br />And then run a join statement like this...<br /><br />create table gamabhana_match as<br />select a.derived_word, b.root_word, b.stemmed_word, b.type, b.code, b.position, b.rule, c.stem1_word, c.stem2_word, c.stem3_word, c.stem4_word, c.stem5_word, c.stem6_word, c.stem7_word<br />from gamabhana as a left join myptg as b<br />on b.derived_word = a.derived_word <br />left join anoop as c<br />on c.derived_word = a.derived_word<br /><br />It will scan around 2 GB data (in this case) and the cost will be around 1 cent per query. This can also be done in MySQL. But importing data and building indexes is not easy. Unlike Athena, MySQL allows unlimited queries for free!<br /><br />Athena is good for data that is important and accessed rarely.<br /></p>shantanuhttp://www.blogger.com/profile/04386423685935921709noreply@blogger.com0