Shantanu's Blog

disable dynamoDB table

2025-07-01T01:03:00.000-07:00

I have DynamoDB table called "autocorrect" and I need to stop all traffic to/ from this table. I can use resouce-based policy like this...

{
"Version": "2012-10-17",
"Id": "PolicyId",
"Statement": [
{
"Sid": "AccessDisabledTemporary",
"Effect": "Deny",
"Principal": "*",
"Action": "dynamodb:*",
"Resource": "arn:aws:dynamodb:us-east-1:XXXX85053566:table/autocorrect"
}
]
}

Public Appeal to LibreOffice Source Code Contributors

2025-06-04T00:26:00.000-07:00

LibreOffice is among the finest software applications I have ever used, and I deeply appreciate all the contributions that have made it so feature-rich and reliable. However, I have a humble request: please ensure that every contribution is thoroughly tested before it is merged into the codebase.

If anyone is still reading, I would like to share more information for your consideration.

I would like to see an improvement in the testing process for LibreOffice. When a change is made to the source code, it should ideally undergo review and testing by at least two independent testers, followed by approval from a senior developer.

1) Even seemingly minor or trivial changes in the source code can have significant and potentially disruptive effects on the user experience. For example, a shortcut key combination — Ctrl + Shift + C — was assigned to the "Track Changes" function:

https://gerrit.libreoffice.org/c/core/+/65041

This raised concerns: Who approved this change? Who tested and validated it before it was committed?

Many users, including myself, were confused as to why the "Track Changes" feature was suddenly being triggered unexpectedly. The impact of this change was discussed in the following bug reports:

https://bugs.documentfoundation.org/show_bug.cgi?id=130847

https://bugs.documentfoundation.org/show_bug.cgi?id=134151

2) Another example is the addition of the Alt + 5 shortcut key to activate the Sidebar pane:

https://bugs.documentfoundation.org/show_bug.cgi?id=158112

3) PDF files are now exported to the most recently used directory, rather than the directory of the active document. This change in LibreOffice's behavior was unexpected and caused inconvenience.

https://bugs.documentfoundation.org/show_bug.cgi?id=165917

4) Additionally, the removal of the "Add to Dictionary" option from the context menu caused inconvenience to many users:

https://bugs.documentfoundation.org/show_bug.cgi?id=166689

Although this particular issue was quickly resolved within five days, it still raises a critical question: Who is responsible for testing and verifying such changes before they are merged? If the "Add to Dictionary" option is not available on right click, proof reading would have become impossible.

To ensure quality and minimize unintended consequences, I recommend establishing a more robust review and testing protocol for code changes.

Manage AWS resources using command line

2025-06-02T22:31:00.000-07:00

1) add access key and secret key of a read-only user

aws configure

2) I need to install amazon Q using the instructions found on this page...

https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-installing-ssh-setup-autocomplete.html

3) I can now start the q service using the command...

./q/bin/q

Use natural language instructions like "list all dynamoDB tables" or "list all S3 buckets"

4) For advance users, I can create MCP server and save database credentials like username and password so that Q can query database and return results.

https://awslabs.github.io/mcp/servers/dynamodb-mcp-server/

Adding a word to dynamoDB table

2025-04-28T22:11:00.000-07:00

If I need to add a word to dynamoDB table, I use the lambda function. The function URL looks like this...

https://z2zsnbwispdo5gh2z544bkblbe0amxfb.lambda-url.us-east-1.on.aws/?धर्माद

And the code is as follows:

import boto3
import urllib.parse

def lambda_handler(event, context):
request_body = event['rawQueryString']
print (request_body)

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('sanskrit')
key = { 'pk': urllib.parse.unquote(request_body)}
table.put_item( Item=key )

return {
'statusCode': 200,
'body': 'success'
}

It saves the word "धर्माद" to the dynamoDB table "sanskrit".

Apply libreoffice styles using a Macro and create PDF

2025-02-19T02:19:00.000-08:00

I have this dockerfile that is working as expected. I use it to convert a txt file to pdf after formatting it using a style created by macro.
_____

FROM ubuntu:latest

# Install LibreOffice and scripting dependencies
RUN apt-get update && apt-get install -y libreoffice libreoffice-script-provider-python libreoffice-script-provider-bsh libreoffice-script-provider-js

# Install required dependencies
RUN apt-get update && apt-get install -y wget unzip fonts-dejavu

# Download and install Shobhika font
RUN mkdir -p /usr/share/fonts/truetype/shobhika && wget -O /tmp/Shobhika.zip https://github.com/Sandhi-IITBombay/Shobhika/releases/download/v1.05/Shobhika-1.05.zip && unzip /tmp/Shobhika.zip -d /tmp/shobhika && mv /tmp/shobhika/Shobhika-1.05/*.otf /usr/share/fonts/truetype/shobhika/

# Create necessary directories with proper permissions
RUN mkdir -p /app/.config/libreoffice/4/user/basic/Standard
RUN chmod -R 777 /app/.config

# Set LibreOffice user profile path
ENV UserInstallation=file:///app/.config/libreoffice/4/user

WORKDIR /app
COPY StyleLibrary.oxt /app/
COPY marathi_spell_check.oxt /app/
COPY myfile.txt /app/

RUN unopkg add /app/StyleLibrary.oxt --shared
RUN unopkg add /app/marathi_spell_check.oxt --shared

# Run the LibreOffice macro
CMD soffice --headless --invisible --norestore "macro:///StyleLibrary.Module1.myStyleMacro2(\"/app/myfile.txt\")"
_____

# create an image:
docker build -t shantanuo/mylibre .

# Run the container:
docker run -v .:/app/ --rm shantanuo/mylibre

As you can see I have applied the styles from StyleLibrary to myfile and then created a pdf document successfully.

RAG made easy using LLama

2025-01-14T22:03:00.000-08:00

# use virtual environment to install python and packages

uv init ai-app2

cd ai-app2

pip install llama-index

# download training data

mkdir data

cd data

wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt

cd ..

# start python prompt

python

import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

response = query_engine.query("What the author do growing up?")

print(response)

Avoid uploading a file to S3 again

2024-12-03T20:30:00.009-08:00

Let's assume I uploaded a file to S3:

aws s3 cp dictionaries.xcu s3://cf-templates-us-east-1/

I need to update that file only if it does not already exist. In that case I use --if-none-match parameter as shown below:

aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --if-none-match "*"

Returns "An error occurred (PreconditionFailed)"

This will help me while trying to upload a large file again.

_____

Following features are available for S3 Express One Zone:

1) In directory buckets, clients can perform conditional delete checks on an object’s last modified time, size, and Etag using the x-amz-if-match-last-modified-time, x-amz-if-match-size, and HTTP if-match headers.

2) Append data to a file:

aws s3api put-object --bucket cf-templates-us-east-1 --key dictionaries.xcu --body dictionaries.xcu --write-offset-bytes file001.bin

Or use python:

s3.put_object(Bucket='amzn-s3-demo-bucket--use2-az2--x-s3', Key='2024-11-05-sdk-test', Body=b'123456789', WriteOffsetBytes=9)

It can not replace your database or Messaging Queues because only a few thousand updates are possible for each object.

3) You can configure S3 Lifecycle rules for S3 Express One Zone to expire objects on your behalf. For example, you can create an S3 Lifecycle rule that expires all objects smaller than 512 KB after 3 days and another rule that expires all objects in a prefix after 10 days.

Language prediction

2024-11-17T20:27:00.004-08:00

FastText library by facebook has the language detection feature.

import fasttext
model = fasttext.load_model("/tmp/lid.176.ftz")
model.predict(" विकिपीडिया पर", k=2)

The above code returns Hindi "hi" correctly. Google also has it's own library called langdetect. The following code returns Marathi "mr" correctly.

from langdetect import detect
detect("आत्मा आणि")

The polyglot library has supported this and other language tools since a very long time.

https://github.com/saffsd/polyglot

awk Case Study - 14

2024-10-28T04:06:00.002-07:00

1) Download stardict files:

git clone https://github.com/freedict/fd-dictionaries.git

2) Download python package to read stardict files:
git clone https://github.com/ilius/pyglossary.git
cd pyglossary/
cp /home/ubuntu/fd-dictionaries/eng-hin/eng-hin.tei .
python3 main.py

# convert eng-hin.tei file to out.txt

Select the first 3 columns:

Change multiple HTML tags to a single pipe | delimiter
and display the first 3 columns

awk '{
gsub(/<[^>]*>/, "|");

gsub(/\|+/, "|");

match($0, /([^|]*\|){3}/);
first_three = substr($0, RSTART, RLENGTH);
print first_three
}' out.txt > test.csv

awk Case Study - 13

2024-10-04T04:28:00.000-07:00

I have 2 text files. corpus file is the collection of words and exclude file has all the suffixes. I need to extract the stemmed words after removing all suffixes.

==> exclude.txt <==
works
ed
s
ing
ings

==> corpus.txt <==
worked
working
works
tested
tests
find
found
workings

awk -f tst.awk exclude.txt corpus.txt | sort

unmatched find
unmatched found
matched working/s
matched work/ed,ing,s,ings
matched test/ed,s

And the awk script will look something like this...

$ cat tst.awk
{ lineLgth = length($0) }
NR == FNR {
suffixes[$0]
sfxLgths[lineLgth]
next
}
{
base = ""
for ( sfxLgth in sfxLgths ) {
baseLgth = lineLgth - sfxLgth
if ( baseLgth > 0 ) {
sfx = substr($0,baseLgth+1)
if ( sfx in suffixes ) {
base = substr($0,1,baseLgth)
bases2sfxs[base] = bases2sfxs[base] "," sfx
}
}
}
if ( base == "" ) {
print "unmatched", $0
}
}
END {
for ( base in bases2sfxs ) {
sub(/,/,"/",bases2sfxs[base])
print "matched", base bases2sfxs[base]
}
}

Firefox and Libreoffice in your browser

2024-09-28T22:16:00.000-07:00

Kasm VNC is a modern open source VNC server.

Quickly connect to your Linux server's desktop from any web browser.
No client software install required.

1) Firefox using VNC

docker run -d \
--name=firefox \
-e PUID=1000 \
-e PGID=1000 \
-e TZ=Etc/UTC \
-p 3000:3000 \
-p 3001:3001 \
-v /path/to/config2:/config \
--shm-size="1gb" \
--restart unless-stopped \
lscr.io/linuxserver/firefox:latest

2) Libreoffice using VNC

docker run -d \
--name=libreoffice \
--security-opt seccomp=unconfined `#optional` \
-e PUID=1000 \
-e PGID=1000 \
-e TZ=Etc/UTC \
-p 3000:3000 \
-p 3001:3001 \
-v /path/to/config:/config \
--restart unless-stopped \
lscr.io/linuxserver/libreoffice:latest

export to pdf using linux command

2024-09-24T22:41:00.000-07:00

You can generate a "pdf" file from Libreoffice writer "odt" file.

File - Export as PDF option is available only if you are using GUI
Here is how to convert to pdf using command line.

# vi Dockerfile
FROM ubuntu:latest

RUN apt-get update && \
apt-get install -y libreoffice

WORKDIR /workspace

ENTRYPOINT ["libreoffice", "--headless", "--convert-to", "pdf"]

# docker build -t shantanuo/libreoffice-converter .

run the docker command to convert a file to pdf
# docker run --rm -v .:/workspace shantanuo/libreoffice-converter /workspace/pm_in_paris.odt --outdir /workspace

_____

Use this dockerfile if you need to apply a template before creating a PDF file.

FROM ubuntu:latest

RUN apt-get update && apt-get install -y libreoffice python3 python3-venv

RUN python3 -m venv /workspace/venv

RUN /workspace/venv/bin/pip install --upgrade pip

RUN /workspace/venv/bin/pip install unotools

COPY * /workspace/

WORKDIR /workspace

# Start LibreOffice in headless mode in the background and run the Python script after it is started

ENTRYPOINT soffice --headless --accept="pipe,name=libreoffice;urp;StarOffice.ComponentContext" & \

sleep 5 && \

python3 /workspace/updated3.py /workspace/ra.txt /workspace/prajakta.ott

I can create an image and it converts the text file to PDF correctly.

docker build -t shantanuo/libreoffice-converter .

The raw text file and template is available in current directory. The generated PDF is also available in the same place after running this command:

docker run -v .:/workspace/ --rm shantanuo/libreoffice-converter

The python code to apply the template and create pdf is available here...

https://gist.github.com/shantanuo/f635bbdb764d1fafa8587203d7f8823a

awk Case Study - 12

2024-09-08T20:48:00.000-07:00

Select the first column from the csv file and remove "www". In sql the command ill look something like this...

select replace(column1, 'www', '') from tbl

# cat logs.csv
Origin,Status,Title,ContentType,IP,Country,City,PhoneCode
https://gnu.org,OK,The GNU Operating System and the Free Software Movement,text/html,209.51.188.116,United States,Boston,+1
https://0t1.me,OK,ZeroToOne - Home,text/html,104.21.84.218,Canada,Toronto,+1

# cat myak.txt
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ""
}

function normalize_origin(origin) {
sub(/www./, "", origin) # remove www from the origin.
return origin
}

{
# Ignore the header line.
if(NR == 1) {
next
}

origin = normalize_origin($1)
print origin
}

# awk -f myak.txt < logs.csv
https://gnu.org
https://0t1.me

https://0t1.me/blog/2024/09/01/practical-awk/

Sanskrit-English translation corpus

2024-08-27T20:47:00.000-07:00

Itihāsa is a Sanskrit-English translation corpus containing 93,000 Sanskrit shlokas and their English translations extracted from M. N. Dutt's seminal works on The Rāmāyana and The Mahābhārata.

https://github.com/rahular/itihasa

Itihāsa can be used directly from Huggingface Datasets:

from datasets import load_dataset
dataset = load_dataset("rahular/itihasa")
dataset['train'][0]

{'translation': {'en': 'The ascetic Vālmīki asked Nārada, the best of sages and foremost of those conversant with words, ever engaged in austerities and Vedic studies.', 'sn': 'ॐ तपः स्वाध्यायनिरतं तपस्वी वाग्विदां वरम्। नारदं परिपप्रच्छ वाल्मीकिर्मुनिपुङ्गवम्॥'}}

Using playwright on ARM processor

2024-08-24T23:13:00.000-07:00

You can use playwright on ARM processor using these steps:

1) Use docker to start a container:
docker run -it --rm --ipc=host mcr.microsoft.com/playwright:v1.46.1-jammy /bin/bash

2) Once inside the container, type these commands:

apt-get install python3-pip
pip install playwright
playwright install

3) Create or copy a test file:
vi app/app.py

from playwright.sync_api import sync_playwright

def handler(event, context):
with sync_playwright() as p:
browser = p.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True)
page = browser.new_page()
page.goto("https://stackoverflow.com/questions/9780717/bash-pip-command-not-found")
print(page.title())
browser.close()

4) Run the file:
python3 app/app.py

If you get the title of the page, i.e. "python - bash: pip: command not found - Stack Overflow" as output then everything is working ok.

_____

Here is another example:

import asyncio

from playwright.async_api import async_playwright # 1.44.0

async def main():

term = "\"टंकलेखन\""

url = f"https://www.google.com/search?q={term}"

async with async_playwright() as pw:

browser = await pw.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True)

page = await browser.new_page()

await page.goto(url, wait_until="domcontentloaded")

# Find the element with ID "result-stats" and get its text

result_stats = await page.locator('#result-stats').text_content()

if result_stats:

print("Result stats:", result_stats)

else:

print("Element 'result-stats' not found.")

await browser.close()

if __name__ == "__main__":

asyncio.run(main())

Disable dynamoDB table access

2024-06-27T01:17:00.000-07:00

I can disable all access to a dynamoDB table using resource based policy. Here is an example:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": {
"AWS": "*"
},
"Action": "dynamodb:*",
"Resource": "arn:aws:dynamodb:us-east-1:XXX885053566:table/sandhiDupe"
}
]
}

There are many other advantages of managing access at resource level.

awk Case Study - 11

2024-06-06T02:16:00.000-07:00

Let's assume there are 2 tables and you need to compare them on first column. SQL query looks something like this...

select a.id, a.long, a.lat, b.location from tbl_a as a inner join tbl_b as b on a.id = b.id

If you have .csv files instead of tables, use awk

a_vt
9998.69,-80.87,-8.7987987988,279.13,-8.7987987988
9998.34,-81.05,-8.43843843844,278.95,-8.43843843844
9999.77,-83.03,-7.71771771772,276.97,-7.71771771772
9999.48,-83.57,-7.23723723724,276.43,-7.23723723724
9999.08,-83.99,-7.2972972973,276.01,-7.2972972973
9998.75,-81.71,-6.996996997,278.29,-6.996996997
9998.75,-81.65,-6.996996997,278.35,-6.996996997
9997.89,-83.99,-6.21621621622,276.01,-6.21621621622
9997.77,-77.27,-16.1261261261,282.73,-16.1261261261
9997.54,-82.43,-4.29429429429,277.57,-4.29429429429

b_vm
9998.69,110.0TN,110.0TN,-75.6551,-14.9496,284.345,-14.9496
9998.34,100.0TN,100.0TN,-75.62949999999998,-14.9573,284.37,-14.9573
22850,39.78686TN,39.78686TN,-75.6259,-14.9867,284.374,-14.9867
22901.9,9.90099TN,9.90099TN,-75.649,-14.9636,284.351,-14.9636
27742.2,160.0TN,160.0TN,-75.5999,-14.9922,284.4,-14.9922
22901.9,110.0TN,110.0TN,-75.6648,-14.9526,284.335,-14.9526
27742.2,90.0TN,90.0TN,-75.60129999999998,-14.9973,284.399,-14.9973
27685.3,90.0TN,90.0TN,-75.6024,-14.9626,284.398,-14.9626
27742.2,80.0TN,80.0TN,-75.6014,-15.0006,284.399,-15.0006
22901.9,80.0TN,80.0TN,-75.6597,-14.9626,284.34,-14.9626

$ awk 'NR==FNR { a[$1]; next }( ($1 in a) ) { print }' FS="," b_vm a_vt
9998.69,-80.87,-8.7987987988,279.13,-8.7987987988
9998.34,-81.05,-8.43843843844,278.95,-8.43843843844

Expected Output:

9998.69,-80.87,-8.7987987988,279.13,-8.7987987988,**110.0TN**
9998.34,-81.05,-8.43843843844,278.95,-8.43843843844,**100.0TN**

i.e. second column from b_vm should be included in the output.
_____

Ans:

awk -F, 'NR==FNR { a[$1] = $2; next } $1 in a {print $0 "," a[$1]}' b_vm a_vt

9998.69,-80.87,-8.7987987988,279.13,-8.7987987988,110.0TN
9998.34,-81.05,-8.43843843844,278.95,-8.43843843844,100.0TN

Here a[$1] = $2 stores $2 in array a by the index $1.
In the 2nd pass print a[$1] to print stored value.

https://stackoverflow.com/questions/78551072/adding-column-after-comparing-two-files

Make Ubuntu great again!

2024-05-18T00:05:00.000-07:00

when you click above or below the slider on a scrollbar, instead of scrolling up or down by a "page", like they have done for many years, instead you now jump to wherever you click.

If you need to change, edit (or create) the file:

~/.config/gtk-3.0/settings.ini

And add the following:

[Settings]
gtk-primary-button-warps-slider = false

do not forget to restart.

Pandas as command prompt

2024-05-05T01:09:00.000-07:00

You can use pandas at command prompt like this...

curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \
gunzip | \
python3 -c 'import sys, pandas as pd
pd.read_csv(sys.stdin).melt("Date").to_csv(sys.stdout, index=False)'

curl -s https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip | \
gunzip | \
python3 -c 'import sys, pandas as pd
pd.read_csv(sys.stdin).iloc[:, :-1].melt("Date")\
.to_csv(sys.stdout, index=False)'

Remote Desktop to Ubuntu Server

2024-05-01T22:28:00.000-07:00

Ubuntu Desktop requires downloading about 500 MB of packages and an additional 2 GB of disk space. If that's too much, you can install a more lightweight desktop environment called Xfce. It's just 45 MB of packages that use an extra 175 MB of space. Install it like this:

sudo apt-get install -y tightvncserver xrdp ubuntu-desktop m17n-db ibus-m17n

_____

1) You may need to change the ubuntu password if using ec2 instance.

echo 'ubuntu:india' | sudo chpasswd

2) The second step is to change PasswordAuthentication to "yes" in the file /etc/ssh/sshd_config

3) avoid security error

sudo adduser xrdp ssl-cert

4) disable screen lock and suspend

gsettings set org.gnome.desktop.session idle-delay 0

systemctl mask suspend.target

Partitioning and bucketing S3 data

2024-04-29T01:10:00.000-07:00

Let's assume we need to optimize this query. The data is stored in S3 (around a few TB)

SELECT * FROM "bucketing_blog"."noaa_remote_original"
WHERE
report_type = 'CRN05'
AND ( station = '99999904237'
OR station = '99999953132'
OR station = '99999903061'
OR station = '99999963856'
OR station = '99999994644'
);

There are around 14325 unique stations and report_type column contain around 13 types.

1) The first thing that can be tried is to create a table with partitioning (using report_type)

CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-non-bucketed/',
partitioned_by = ARRAY['report_type'],
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original";

If you need faster and cheaper query, use buckets

CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
external_location = 's3://<your-s3-location>/athena-bucketed/',
partitioned_by = ARRAY['report_type'],
bucketed_by = ARRAY['station'],
bucket_count = 16,
format = 'PARQUET',
write_compression = 'SNAPPY'
)
AS
SELECT * FROM "bucketing_blog"."noaa_remote_original"

I created 16 buckets because that is the maximum number of stations that may appear in the SQL query.

https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/

awk Case Study - 10

2023-11-18T00:08:00.000-08:00

Formats its input into lines that are at most 60 characters long

# fmt - format
# input: text
# output: text formatted into lines of <= 60 characters

awk '/./ { for (i = 1; i <= NF; i++) addword($i) }
/^$/ { printline(); print "" }
END { printline() }

function addword(w) {
if (length(line) + length(w) > 60)
printline()
line = line " " w
}
function printline() {
if (length(line) > 0) {
print substr(line, 2)
line = ""
}
}' long.txt

awk Case Study - 9

2023-11-17T23:54:00.000-08:00

Cliche generator, which creates new cliches out of old ones. The input is a set of sentences like

# cat cliche.txt

A rolling stone:gathers no moss.
History:repeats itself.
He who lives by the sword:shall die by the sword.
A jack of all trades:is master of none.
Nature:abhors a vacuum.
Every man:has a price.
All's well that:ends well.

where a colon separates subject from predicate. Our cliche program combines a random subject with a random predicate; with luck it produces the occasional mildly amusing aphorism:

A rolling stone repeats itself.
History abhors a vacuum.
Nature repeats itself.
All's well that gathers no moss.
He who lives by the sword has a price.

# cliche - generate an endless stream of cliches
# input: lines of form subject:predicate
# output: lines of random subject and random predicate

awk 'BEGIN { FS = ":" }

{ x[NR] = $1; y[NR] = $2 }

END { for (;;) print x[randint(NR)], y[randint(NR)] }

function randint(n) {

return int(n *rand()) + 1

}' cliche.txt

Don't forget that this program is intentionally an infinite loop.

awk Case Study - 8

2023-11-17T23:39:00.000-08:00

Isolate the words and aggregate the count for each word in an associative array. A word is a field without the punctuation marks like ? or ,

# wordfreq - print number of occurrences of each word

# input: text

# output: number-word pairs sorted by number

awk '{
gsub (/I [ ., : ; I ? ( ) { } ] /, "" )
for (i = 1; i <= NF; i++)
count[$i]++
}
END {
for (w in count)
print count[w], w | "sort -rn"
}' capitals

awk Case Study - 7

2023-11-17T23:29:00.000-08:00

Print the names of the countries in Asia along with their populations and capitals:

# cat capitals

USSR Moscow
Canada Ottawa
China Beijing
USA Washington
Brazil Brasilia
India New Delhi
Mexico Mexico
France Paris
Japan Tokyo
Germany Bonn
England London

# cat countries
USSR 8649 275 Asia
Canada 3852 25 North America
China 3705 1032 Asia
USA 3615 237 North America
Brazil 3286 134 South America
India 1267 746 Asia
Mexico 762 78 North America
France 211 55 Europe
Japan 144 120 Asia
Germany 96 61 Europe
England 94 56 Europe

(make sure that the files are tab separated)

# awk 'BEGIN { FS = "\t"}
FILENAME == "capitals" {
cap[$1] = $2
}
FILENAME == "countries" && $4 == "Asia" {
print $1, $3, cap[$1]
}' capitals countries

USSR 275 Moscow
China 1032 Beijing
India 746 New Delhi
Japan 120 Tokyo

It would certainly be easier if we could just say something like

continent ~ /Asia/ { print $country, $population, $capital}

and have a program figure out where the fields are and how to put them together. This is how we would phrase this query in qawk