Helyx.org

Generate S3 presigned URLs with Boto3

Alexis Kinsella — Fri, 01 Jan 2021 15:01:21 GMT

AWS SDKs are available for different languages. However Python is a language of choice to write serverless code and process data. For this reason, we will use it to showcase how to create a presigned URL for an S3 object.

Whatever the reason you want to restrict access to a specific object stored into Amazon S3, you will have to use presigned URLs to give access to that object if you want to make it accessible to a restricted audience without having to configure a set of permissions associated to an access key.

You don't want to create a specific user and an associated access key each time you want to make a restricted resource accessible. It is not manageable.

Also, you don't want to make the resource public as you want to keep control to who is accessing your restricted resource.

For this reason, a goos option is the use of presigned URLs. You can automate creation of it into your code, and you can make them expire whenever you want.

The counterpart, is that you cannot revoke a permission to access an object through a presigned URL. You will have to remove the resource from its location.

Also, you cannot protect of URL sharing. It means that the use of presigned URLs must be compatible with your needs. For example, if you need to generate links to some content only for a limited period of time, it will be a great fit.

Create a presigned URL with Boto3

Boto3, the AWS SDK for Python, will allow you to interact with Amazon S3 service and generate pre-signed URL. The example below generates presigned URL to an object (here, a json file) stored in a bucket at some prefix with a validity period of 1 day :

import boto3
from botocore.client import Config

# Get the service client with sigv4 configured
s3 = boto3.client('s3', config=Config(signature_version='s3v4'))

# Generate the URL to get 'key-name' from 'bucket-name'
# URL expires in 604800 seconds (seven days)
url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': 'my-bucket',
        'Key': 'some-path/file-to-access.json'
    },
    ExpiresIn=86400
)

print(url)

Generate a presigned URL with Boto3

Given you save that script into a python file named generate_presigned_url.py. You will be able to call it with the following command:

AWS_PROFILE=my_profile generate_presigned_url.py

Here, we are using an already configured AWS profile called my_profile.

The result will correspond to something like that:

https://my-bucket.s3.amazonaws.com/some-path/file-to-access.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAYSWZVJ32FGBAXBNB%2F20210101%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210101T134645Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=92e1ff94b2eb8541d9f4b1ea058c02d69717383fa39daec91b5bb31e2f90f4d4

By reviewing the result, we can observe that the URL points to the configured bucket & prefix, and that query string parameters were generated:

The algorithm used: X-Amz-Algorithm=AWS4-HMAC-SHA256
The credential mixed with additional information: X-Amz-Credential=AKIAYSWZVJ32FGBAXBNB%2F20210101%2Feu-west-1%2Fs3%2Faws4_request
The generation date in ISO8601 format: X-Amz-Date=20210101T134645Z
The validity period: X-Amz-Expires=86400
The headers used for the signature: X-Amz-SignedHeaders=host
and then, the signature, that allow to check the URL has not been modified: X-Amz-Signature=92e1ff94b2eb8541d9f4b1ea058c02d69717383fa39daec91b5bb31e2f90f4d4

To improve usability of the script, it might be a good idea to parse the command line arguments, and use them to configure the Boto3 call to generate the signed URL:

import boto3
from botocore.client import Config
import argparse

parser = argparse.ArgumentParser("generate_signed_url")
parser.add_argument("bucket", help="S3 Bucket", type=str)
parser.add_argument("key", help="S3 key", type=str)
parser.add_argument("expires_in", help="Expire in", type=int)
args = parser.parse_args()


# Get the service client with sigv4 configured
s3 = boto3.client('s3', config=Config(signature_version='s3v4'))

# Generate the URL to get 'key-name' from 'bucket-name'
# URL expires in 604800 seconds (seven days)
url = s3.generate_presigned_url(
    ClientMethod='get_object',
    Params={
        'Bucket': args.bucket,
        'Key': args.key
    },
    ExpiresIn=args.expires_in
)

print(url)

Generate a presigned URL with Boto3 with command line arguments

The previous script can be used by configuring some additional arguments on command line:

AWS_PROFILE=my_profile generate_signed_url.py my_bucket my_prefix/my_object.json 3600

S3v4 signatures

Previous examples have been configured to use S3v4 signature to generate presigned URLs. Calling generate_presigned_url function without configuring Boto3 session to use s3v4 signatures will results in a different signature format:

https://s3.eu-west-1.amazonaws.com/my_bucket/my_prefix/my_object.json?AWSAccessKeyId=AKIAYSWZVJ32FGBAXBNB&Signature=LYlMYi2LMr4dQK4ivSGVUiF5Yqo%3D&Expires=1609513255

This detail might not seem to be important. However, given you try to provide access to an file encrypted with AWS KMS managed key, you will fail to generate a valid presigned URL if use of AWS Signature Version 4 is not configured on the Boto3 session, and using another signature format will result in the following error:


InvalidArgument
Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.
Authorization
null
80149C77623B9D45
HZBvZonRGHTPRI51YQYQZIuRqsclhb1RddrM2F7jbvKMVTUBbfhEq9N9HJhj4sRngjTRlbrxYyi=

Temporary credentials

As presigned URLs inherit from the IAM principal that makes the call, if the IAM principal used is one with temporary credentials, for example a STS session of 1 hour, then even if you set your expire to 1 day, the access to the resource through the presigned URL will be rejected as soon as the session from the IAM principal becomes invalid. In the given example, the presigned URL would become invalid after 1 hour.

Presigned URLs limitations

Validity period will vary given you created your presigned URL with:

IAM instance profile (Valid up to 6 hours)
AWS Security Token Service (Valid up to 36 hours)
or with IAM user (Valid up to 7 days with AWS Signature Version 4).

Presigned URLs for file upload

Presigned URLs can be used in many situations to access resources already stored in S3. However, you have to know, that you can also use presigned URLs to upload objects to S3.

It is useful when you want your user/customer to be able to upload a specific object to your S3 storage without providing AWS security credentials.

As presigned URLs inherit from the IAM principal that makes the call, you should carefully design associated permissions to avoid security issues. It is possible for example to limit use from specific network paths ( with aws:SourceIP, aws:SourceVPC, aws:SourceVPCe conditions in policy definitions).

Additional resources

You can refer to more detailed explanations in the AWS documentation to share objects at this page: https://docs.aws.amazon.com/AmazonS3/latest/dev/ShareObjectPreSignedURL.html, and to upload objects here: https://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html.

Introduction to AWS CloudShell

Alexis Kinsella — Sat, 26 Dec 2020 21:22:32 GMT

During the re:Invent 2020 Developer Keynote, presented by Dr. Werner Vogels, was introduced a new handy service named AWS CloudShell.

AWS CloudShell is aimed at providing an AWS-enabled shell prompt in the browser that is simple and secure with as little friction as possible.

AWS CloudShell is generally available in us-east-1 (N. Virginia), us-east-2 (Ohio), us-west-2 (Oregon), ap-northeast-1 (Tokyo), and eu-west-1 (Ireland) at launch.

AWS CloudShell in a nutshell

By announcing this new service, AWS fills a gap that has been present for years, and where competition has been providing solutions for a long time, starting with GCP Cloud Shell.

You can see on YouTube an introduction of the service during Werner Vogels Keynote:

AWS CloudShell introduction by Werner Vogels

Accessing AWS CloudShell

To access the AWS CloudShell, you just have to connect to the AWS Console and click to the icon available in top-right navigation menu.

AWS CloudShell button

By clicking on the icon, a new page will open to the AWS CloudShell home page and a new AWS CloudShell instance will start:

AWS CloudShell

The command-line provided has the AWS Command Line Interface (CLI) (v2) installed and configured so that you can run AWS commands without requiring any additional setup or configuration.

The environment is providing pre-installed Python & Node runtimes and tools such as jq.

AWS Cloud Shell is based on Amazon Linux 2.

Shells

3 shells are pre-installed : Bash which is the default shell, Z Shell also known as zsh, that provides customization with themes and plugins, and PowerShell.

If you are a Microsoft user, PowerShell availability, built on top of Microsoft's .NET Command Language Runtime, will make you happy, and will let you take advantage of its deep integration with .NET.

Shell in use can be identified by the command prompt: $ corresponds to Bash, PS> corresponds to PowerShell and %corresponds to zsh.

The default user is cloudshell-user which is not the default user that you will find in Amazon Linux EC2 instances (ec2-user). Using some scripts designed for EC2 may result in some issue if they are not adapted to run on AWS CloudShell.

Additional AWS command line interfaces (CLI)

In addition of the default AWS CLI, additional CLIs are provided pre-installed, which is handy, as it takes times whenever you want to use one of them, as you have to find related instructions to make the installation. Provided CLIs are:

AWS Elastic Beanstalk CLI (eb),
Amazon Elastic Container Service (Amazon ECS) CLI (ecs-cli)
AWS SAM CLI (sam).

It is always time consuming to setup a shell when you want to interact with your account resources. Moreover, as you don't do this kind of installation every other day, it means that you have to remember how to setup your tooling.

With AWS CloudShell, you always have at hand a working environment that does not require to spend time at installing tooling on a system that you don't own whether you are on a Linux, Windows or Mac machine.

Also, you don't have that much to worry about the cleanup of the machine after its usage as AWS CloudShell is available from the browser.

A simple history cleanup of the browser or accessing the service via private browsing should be enough (given that the computer is not compromised).

Development tools and shell utilities

Many tools and shell utilities are also pre-installed: git, iputils, jq, tmux, vim, wget or CodeCommit utility for Git (git-remote-codecommit) which provides a simple method for pushing and pulling code from CodeCommit repositories by extending Git.

By default, AWS CloudShell users have sudo privileges. Therefore, it is possible to use the sudo command to install additional software. As AWS CloudShell is based on Amazon Linux 2, you will have to use yum to install software.

However, additional software has to be installed on each session as setups are recycled between sessions.

It is possible to customize the initialization of AWS CloudShell sessions by customizing the .bashrc. In case of access loss to the session due to any error, it is still possible to delete the home directory (Action is available from Action Menu).

In case of advanced customization needs, it can be preferable to rely on code versioning for example with Git.

Here is a full list of programs available in the /usr/bin directory:

/usr/bin programs

amazon-linux-extras command is available as part of the standard installation. It means that many additional software can be installed with ease.

For example, to install java-openjdk11, you just have to execute the following commands:

sudo amazon-linux-extras enable java-openjdk11
sudo yum install java-11-openjdk

Install java-openjdk11

After installation, executing java -version will return the following result:

openjdk version "11.0.7" 2020-04-14 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.7+10-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.7+10-LTS, mixed mode, sharing)

`java -version` information

Deleting home directory

Deleting data stored in the home directory is permanent. It cannot be reversed, but it can be useful either in case of issue, or to simply remove all data.

Limits of persistent storage

AWS CloudShell allows to store 1 GB of data in each region at no cost. Only data stored in the Home directory ($HOME) will be persisted between 2 sessions. Data stored in other locations is automatically wiped at the end of a session.

Data is retained for a maximum of 120 days after the end of the last session for a given region.

AWS CloudShell has been implemented using cryptographic keys provided by AWS KMS. The service generates and manages cryptographic keys used for encrypting data.

Other shell limits

It is possible to run a maximum of 10 shells at the same time for each region at no charge.

After 20 to 30 minutes of inactivity the session will end.

Processes in background are not considered as activities. Only keyboard & mouse interactions will be considered as activities and extend sessions. However, there is a hard limit of 12 hours of activity. After this period of time, the session will automatically end.

When the session times out, it is possible to reconnect simply by clicking on the reconnect button.

Reconnect popup

Instance metadata

It is worth noting that instance metadata are not available from AWS CloudShell as opposed to EC2 instances. Trying to call the magic URL results in the following error message: "curl: (7) Couldn't connect to server".

Instance metadata

Network Access & Data Transfer

AWS CloudShell session users can access the public internet, however it is not possible to reach inbound ports from outside. No public IP address is available.

As download & upload can be slow, the preferred way to handle large files will be to use S3 storage from the command line interface.

Download & Upload features are accessible from the Action menu:

Action Menu

Shell Layouts

It is possible to split horizontally & vertically the main window as well as to create tabs to organize efficiently the workspace.

Shell layout

In addition, as preference pane will give access to additional customization parameters such as font size or theme used:

AWS CloudShell Preferences

Enable Safe Paste option available in the preference pane is a security feature that allows you to require yourself to verify that multi-line text that you are about to paste does not contain malicious scripts.

Compute environment resources

Each AWS CloudShell is assigned CPU & memory resources. More specifically, 1 vCPU & 2 GiB of RAM are provided for free.

It is worth nothing that AWS CloudShell service does not provide support for Docker.

Trying to install docker with amazon-linux-extra will fail. Executing docker ps command returns the following error:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

`docker ps` command error

It should still be possible to configure the client to connect to a remote docker daemon.

Security & compliance

By default, AWS CloudShell installs automatically security patches for the system packages. It means that you don't have to worry about it.

Regarding at compliance, AWS CloudShell is not in scope of any specific compliance programs.

If you are interested at monitoring activity of the service, it is possible to do it through Cloud Trail integration that can report a number of events either related to the activity of the user in the console or to API interactions.

It is also possible to leverage EventBridge rules to react to AWS CloudShell events.

Permissions

When it comes to refine permissions given to a specific user, IAM policies allows to customize at the level of expectation.

By default, The AWSCloudShellFullAccess grants permission to use AWS CloudShell with full access to all features.

However, it is also possible to restrict as usual permissions by customizing permissions through custom defined policies.

Permission prefix for AWS CloudShell service will be: cloudshell.

3 permissions specific to the service are available:

cloudshell:CreateSession , which allows to start a shell session
cloudshell:GetFileDownloadUrls, which allows to download files from the shell environment to a local machine
cloudshell:GetFileUploadUrls, which allows to upload files from a local machine to the shell environment

It is possible, for example, to restrict access to AWS CloudShell by blocking file uploads & downloads in the shell environment by defining a policy as following:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "CloudShellUser",
        "Effect": "Allow",
        "Action": [
            "cloudshell:*"
        ],
        "Resource": "*"
    }, {
        "Sid": "DenyUploadDownload",
        "Effect": "Deny",
        "Action": [
            "cloudshell:GetFileDownloadUrls",
            "cloudshell:GetFileUploadUrls"
        ],
        "Resource": "*"
    }]
}

Custom AWS CloudShell policy

The greatness of AWS CloudShell resides in inheritance of permissions from the user connected to AWS Console. AWS CloudShell assumes the identity of the connected user.

Pricing

Users are not charged when using AWS CloudShell. It means that you don't have to worry about pricing. Also, there is no minimum fees or required upfront commitments. Only data transfer is billed at standard rates.

AWS CloudShell plugin for VSCode

An unofficial plugin for VSCode has been built to integrate VSCode with AWS CloudShell. It will allow to open multiple AWS CloudShell terminals within VSCode on demand.

AWS CloudShell plugin for VSCode

More information available on the GitHub page of the plugin: https://github.com/iann0036/vscode-aws-cloudshell.

To get it work, AWS CLI must be installed as well as the Session Manager plugin for VSCode.

It is also required to configure properly an AWS Profile and configure VSCode plugin with it.

Conclusion

Sure, AWS CloudShell is not a technological revolution, but it fills a gap that remained open for a long time. The service still lacks some features compared to equivalent solutions available for example in GCP, but it is a first step in the right direction.

Useful link

Page of the service: https://aws.amazon.com/cloudshell
AWS Blog announcement article: https://aws.amazon.com/fr/blogs/aws/aws-cloudshell-command-line-access-to-aws-resources/

Pandas on AWS with AWS Data Wrangler

Alexis Kinsella — Tue, 09 Jun 2020 21:55:36 GMT

What is AWS Data Wrangler library ? The GitHub page of the project describes the library as Pandas on AWS.

In case, you stayed in your cave for a long time, Pandas is an open source data analysis and manipulation tool, built on top of the Python programming language. Pandas is designed to be fast, powerful, flexible and easy to use.

Positioning itself a “ Pandas on AWS ” immediately raises the bar.

It is a project available from the GitHub organization AWSLab. You can find the organization page a bunch of projects open sourced by AWS, some of them more or less used or mature. The s2n project, an implementation of the TLS/SSL protocols, is a good example of mature projects available.

AWS Data Wrangler module represents to date, more than 771 commits, 20 contributors, and 52 releases. Versions are currently released at a sustained pace, and the Python module is currently available in version 1.4.0.

Installation

There are two ways to install the module. Either using pip or using Conda.

Pip install

To install the module with pip, you can use the following command:

pip install awswrangler

Conda install

If you are a Conda user, instead, you can install the module with the following command:

conda install -c conda-forge awswrangler

Basic usage

Following the GitHub readme introduction, here is the way to create a basic DataFrame with Pandas:

import pandas as pd

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "bar"]})

And, then import the AWS Data Wrangler module:

import awswrangler as wr

Write data to Amazon S3

Now, lets create, into an S3 bucket, a data file representing the data from the DataFrame serialized into a file:

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

Easy ! An s3 variable at the root of the AWS Data Wrangler module lets the user access functions allowing to interact with s3, in this case to flush the DataFrame to S3.

Read data from Amazon S3

The reverse function is also available allowing to read the data from S3:

# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)

You may wonder what is possible to do with the AWS Data Wrangler package apart interacting with S3. Let's take a free tour over some of the libraries features to discover some of its capabilities.

Definition

Here is an accurate definition of the library as displayed in the documentation:

An open-source Python package that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, etc).

Built on top of other open-source projects like Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 and PyMySQL, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.

Supported services

The aim of the library is to simplify interaction with the data across AWS supported services. Basically, AWS Data Wrangler library is supporting 5 services from AWS:

Amazon S3
AWS Glue Catalog
Amazon Athena
Databases (Redshift, PostgreSQL & Mysql)
EMR
CloudWatch Logs

As the library extends the power of Pandas library to AWS connecting DataFrames and AWS data related services, most of operations available, directly dealing with loading or flushing the data, will rely on Pandas DataFrames.

Simplifying interactions

However, the package is not focused only on loading / unloading the data. The package is also meant to simplify things, more specifically, simplifying interactions with services.

The library provides, for example, functions to :

Load / unload data for Redshift
Generate a Redshift copy manifest instead of having to generate it by yourself

but also to :

simplify create of EMR clusters or definition and submission of build steps.

Interacting with AWS Athena

Interacting with AWS Athena can be cumbersome. To reduce the burden, you have access to functions making things easier, for example, to start, stop, or wait for query completion.

Goodness does not stop on AWS Athena simplified interactions. You will also find improvements in interacting with AWS Glue Data Catalog, making code writing straightforward.

AWS Data Wrangler as default way to interact ?

Given all this improvements made available over the standard APIs, it should be a no brainer to use it as your default way to interact with the supported services in a data processing context with Python.

Lets now go deeper in more detailed examples and notions around the AWS Data Wrangler package. To do that, let's start with sessions.

Sessions

AWS Data Wrangler interacts with AWS services using a default Boto3 Session. That's why, you won't have to provide most of the time any session informations. However, if you need to customize the session the module is working with, it is possible to reconfigure default boto3 session:

 boto3.setup_default_session(region_name="eu-west-1")

or even instantiate a new boto3 session, and passing it as a named parameter to function calls:

session = boto3.Session(region_name="us-east-2")
wr.s3.does_object_exist("s3://foo/bar", boto3_session=session)

Amazon S3

As mentioned previously, an s3 variable is available at the root of the AWS Data Wrangler module. The s3 variable will essentially allow you to interact with Amazon S3 service to work on CSV, JSON, Parquet and fixed-width formatted files along with having access to some handy functions purely related to file manipulations.

Lets define first 2 DataFrames:

import awswrangler as wr
import pandas as pd
import boto3

df1 = pd.DataFrame({
    "id": [1, 2],
    "name": ["foo", "boo"]
})

df2 = pd.DataFrame({
    "id": [3],
    "name": ["bar"]
})

Having those 2 DataFrames created, it will be possible to write them simply to S3 this way:

bucket = "my-bucket"

path1 = f"s3://{bucket}/csv/file1.csv"
path2 = f"s3://{bucket}/csv/file2.csv"

wr.s3.to_csv(df1, path1, index=False)
wr.s3.to_csv(df2, path2, index=False)

As a result, it is also possible to read the previously written files in similar fashion:

df1Bis = wr.s3.read_csv(path1)

df1bis and df1 should present the exact same data.

Finally, it is also possible to re-read written data by reading multiple CSV files at once, listing explicitly which files have to be read:

wr.s3.read_csv([path1, path2])

Things can be made even easier by providing only the prefix to read data from:

wr.s3.read_csv(f"s3://{bucket}/csv/")

As seen, in example before, it is very easy to interact with S3, without having to deal with code complexities or boilerplates.

AWS Glue Data Catalog

Having tried a demo of the library interacting with Amazon S3, the next step is to let the user interact directly with the AWS Glue Data Catalog ?

To interact with, the user just have to use the catalog variable on the module.

wr.catalog.databases()

Previous command should return the database list this way:

Database	Description
0	awswrangler_test	AWS Data Wrangler Test Arena - Glue Database
1	default	Default Hive database
2	sampledb	Sample database

It may not be that simple with direct usage of boto3 API. But it will be that simple also to list available tables in a specific Database:

wr.catalog.tables(database="awswrangler_test")

The command should return the following result:


Database	Table	Description	Columns	Partitions
0	awswrangler_test	lambda		col1, col2
1	awswrangler_test	noaa		id, dt, element, value, m_flag, q_flag, s_flag...

Now, to get table details, meaning column informations, there is just the need to call the table() function over the catalog variable.

wr.catalog.table(database="awswrangler_test", table="boston")

The command should return the following field list:


Column Name	Type	Partition	Comment
0	crim	double	False	per capita crime rate by town
1	zn	double	False	proportion of residential land zoned for lots ...
2	indus	double	False	proportion of non-retail business acres per town
3	chas	double	False	Charles River dummy variable (= 1 if tract bou...
4	nox	double	False	nitric oxides concentration (parts per 10 mill...
5	rm	double	False	average number of rooms per dwelling
6	age	double	False	proportion of owner-occupied units built prior...
7	dis	double	False	weighted distances to five Boston employment c...
8	rad	double	False	index of accessibility to radial highways
9	tax	double	False	full-value property-tax rate per $10,000
10	ptratio	double	False	pupil-teacher ratio by town
11	b	double	False	1000(Bk - 0.63)^2 where Bk is the proportion o...
12	lstat	double	False	lower status of the population
13	target	double	False

You may wonder however how to create a table, let's say in Parquet format. To proceed, you have to call the function to_parquet() on s3 variable providing the required parameters:

Parameter	Type	Description
df	pandas.DataFrame	Pandas DataFrame https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
path	str	S3 path (for file e.g. s3://bucket/prefix/filename.parquet) (for dataset e.g. s3://bucket/prefix)
dataset	bool	If True store a parquet dataset instead of a single file. If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments.
database	str, optional	Glue/Athena catalog: Database name.
table	str, optional	Glue/Athena catalog: Table name
mode	str, optional	append (Default), overwrite, overwrite_partitions. Only takes effect if dataset=True
description	str, optional
parameters	Dict[str, str], optional
columns_comments	Dict[str, str], optional

All parameters can be found at the following URL: https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html#awswrangler.s3.to_parquet.

Writing a pandas DataFrame to S3 in Parquet format, and referencing it in Glue Data Catalog, can be done this way with the following code:


desc = """This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression problems.
"""

param = {
    "source": "scikit-learn",
    "class": "cities"
}

comments = {
    "crim": "per capita crime rate by town",
    "zn": "proportion of residential land zoned for lots over 25,000 sq.ft.",
    "indus": "proportion of non-retail business acres per town",
    "chas": "Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)",
    "nox": "nitric oxides concentration (parts per 10 million)",
    "rm": "average number of rooms per dwelling",
    "age": "proportion of owner-occupied units built prior to 1940",
    "dis": "weighted distances to five Boston employment centres",
    "rad": "index of accessibility to radial highways",
    "tax": "full-value property-tax rate per $10,000",
    "ptratio": "pupil-teacher ratio by town",
    "b": "1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town",
    "lstat": "lower status of the population",
}

res = wr.s3.to_parquet(
    df=df,
    path=f"s3://{bucket}/boston",
    dataset=True,
    database="awswrangler_test",
    table="boston",
    mode="overwrite",
    description=desc,
    parameters=param,
    columns_comments=comments
)

This code example is sourced from the AWS Data Wrangler tutorials, and more specifically the following one: https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/005%20-%20Glue%20Catalog.ipynb.

The execution of previous code sample in AWS Glue Data Catalog results in the following table informations:

AWS Athena

Now that we have learned to interact with Amazon S3 and AWS Glue Data Catalog, and that we know how to flush DataFrames in S3 and reference it as a dataset in the Data Catalog, we can focus on how to interact with data stored with the service AWS Athena.

AWS Data Wrangler allows to run queries on Athena and fetches results in two ways:

Using CTAS (ctas_approach=True), which is the default method.
Using regular queries (ctas_approach=True), and parsing CSV results on S3.

ctas_approach=True

As mentioned in tutorials, this first approach allows to wrap the query with a CTAS, and read the table data as parquet directly from S3. It is faster as it relies on Parquet and not CSV, but it also enables support for nested types. It is mostly a trick compared to the original approach provided officially by the API, but it is effective and fully legal.

The counterpart to use this approach is that you need additional permissions on Glue (Requires create/delete table permissions). The background mechanism is based on the creation of a temporary table that will be immediately deleted after consumption.

Query example:

wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test")

ctas_approach=False

Using the regular approach parsing the resulting CSV on S3 provided as query execution result does not requires additional permissions. The read of results will not be as fast as the approach relying on CTAS, but it will anyway be faster than reading results with standard AWS APIs.

Query example:

wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test", ctas_approach=False)

The only difference with previous example is the change of ctas_approach parameter value from True to False.

Use of categories

Defining DataFrame columns as category allows to optimize the speed of execution, but also helps to save memory. There is only the need to define an additional parameter categories to the function to leverage the improvement.

wr.athena.read_sql_query("SELECT * FROM noaa", database="awswrangler_test", categories=["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"])

The returned columns are of type pandas.Categorical .

Batching read of results

This option is good for memory constrained environments. Activating this option can be done by passing parameter chunksize. The value provided corresponds to the size of the chunk of data to read. Reading datasets this way allows to limit and constrain memory used, but also implies to read the full results by iterating over chunks.

Query example:

dfs = wr.athena.read_sql_query(
    "SELECT * FROM noaa",
    database="awswrangler_test",
    ctas_approach=False,
    chunksize=10_000_000
)

for df in dfs:  # Batching
    print(len(df.index))

Knowing that big datasets can be challenging to load and read, it is a good workaround to avoid memory issues.

Packaging & Dependencies

Availability as an AWS Lambda layer

Going behind the toy demo, you may wonder how to integrate it with your code. Is it integrable with ease using for example AWS Lambda functions ? Will you have to build a complex pipeline to integrate it the right way into your AWS Lambda package ?

The answer is definitively: No ! A Lambda Layer's zip-file is available along Python wheels & eggs. The Lambda Layers are available at the moment in 3 flavors: Python 3.6, 3.7 & 3.8.

AWS Glue integration

As the AWS Data Wrangler package counts on compiled dependencies (C/C++), there is no support for Glue PySpark by now. Only integration with Glue Python Shell is possible at the moment.

Going one step deeper

If you want to learn more about the library, fee free to read the documentation as it is a good source of inspiration. You can also visit the GitHub repository of the project and crawl the tutorial directory.

Must known options of the AWS CLI

Alexis Kinsella — Thu, 21 May 2020 21:41:49 GMT

So you have installed the AWS CLI on your system. What can you do with it ? Let's do some exploration on some basic usages.

Know how to get help

At a moment or another, you will have the need to get some help. You have the option to crawl the internet, but you can also just use what is at your finger tips.

By typing `aws` command in your favorite shell, you will get the usual usage informations relative to the command:

$ aws
usage: aws [options]   [ ...] [parameters]
To see help text, you can run:

  aws help
  aws  help
  aws   help
aws: error: the following arguments are required: command

By reading carefully the usage, you may remark, you can access help at CLI level, command level, and then command / subcommand level.

Without even paying attention, we just got an interesting information: AWS CLI relies not only on commands, but also subcommands. Basically, it helps to reference services at command level, and then actions related to the selected service at subcommand level.

Here is the command structure:

$ aws   [options and parameters]

Depending on the command /subcommand used, you will be able to use various types of input values, such as numbers, strings, lists, maps or even JSON structures.

By executing command `aws help`, you will get the following answer:

AWS()



NAME
       aws -

DESCRIPTION
       The  AWS  Command  Line  Interface is a unified tool to manage your AWS
       services.

SYNOPSIS
          aws [options]   [parameters]

       Use aws command help for information on a  specific  command.  Use  aws
       help  topics  to view a list of available help topics. The synopsis for
       each command shows its parameters and their usage. Optional  parameters
       are shown in square brackets.

OPTIONS
       --debug (boolean)

       Turn on debug logging.

       --endpoint-url (string)

       Override command's default URL with the given URL.

       --no-verify-ssl (boolean)

       By  default, the AWS CLI uses SSL when communicating with AWS services.
       For each SSL connection, the AWS CLI will verify SSL certificates. This
       option overrides the default behavior of verifying SSL certificates.

       --no-paginate (boolean)

       Disable automatic pagination.

       --output (string)

       The formatting style for command output.

       o json

       o text

       o table

       --query (string)

       A JMESPath query to use in filtering the response data.

       --profile (string)

       Use a specific profile from your credential file.

       --region (string)

       The region to use. Overrides config/env settings.

       --version (string)

       Display the version of this tool.

       --color (string)

       Turn on/off color output.

       o on

       o off

       o auto

       --no-sign-request (boolean)

       Do  not  sign requests. Credentials will not be loaded if this argument
       is provided.

       --ca-bundle (string)

       The CA certificate bundle to use when verifying SSL certificates. Over-
       rides config/env settings.

       --cli-read-timeout (int)

       The  maximum socket read time in seconds. If the value is set to 0, the
       socket read will be blocking and not timeout.

       --cli-connect-timeout (int)

       The maximum socket connect time in seconds. If the value is set  to  0,
       the socket connect will be blocking and not timeout.

AVAILABLE SERVICES
      o ...

Checking at the bottom, you can see you will have access to the full list of services supported by the version of the CLI. But more important, you have all the options of the CLI, and there you can already see some goodness related to the CLI as debug logging, switch of target endpoint, response content filtering, and even configuration of targeted region or used profile.

Let's go through some interesting options available and see what they have to offer !

Debug logging

aws --debug ...

Being able to troubleshoot commands may become critical when you experiment issues with the AWS CLI. The simple debug flag will activate highly verbose debug logs, providing you precious information you need to understand what is ongoing.

Endpoint URL

aws --endpoint-url  ...

Whenever you start using AWS services to host an endpoint directly within a private VPC, you have to specify them to use them instead of using the default pubic one.

It may be, especially useful in entreprise when an integration exists for example between the company network and the VPC, meaning that if you want to avoid to go through the internet, you will have to configure and use the VPC Endpoint associated with the service you are targeting.

Output format

The output flag is very handy. It allow to provide answers with multiple formats. It is possible to deal with: json, yaml, text, and table.

On one hand side, the text format is useful to process responses with standard Unix tools as `grep`, `sed` or `awk`. On the other hand, the table format allows to read data in table format.

Output flag value can be pre-configured into the AWS CLI config file. Here is an example:

[default]
output=text

It is also possible to specify it with an environment variable:

$ export AWS_DEFAULT_OUTPUT="table"

But definitively, you may want to override default configuration with the flag:

$ aws swf list-domains --registration-status REGISTERED --output json

Text format

Using the text format will enable alternative presentation that may fit better with the need to execute requests and get results that may be much readable:

$ aws iam list-users --output text --query 'Users[*].[UserName,Arn,CreateDate,PasswordLastUsed,UserId]'

Admin         arn:aws:iam::123456789012:user/Admin         2014-10-16T16:03:09+00:00   2016-06-03T18:37:29+00:00   AIDA1111111111EXAMPLE backup-user   arn:aws:iam::123456789012:user/backup-user   2019-09-17T19:30:40+00:00   None                        AIDA2222222222EXAMPLE cli-user      arn:aws:iam::123456789012:user/cli-backup

Table format

Given you want to read something more tabular and more visual to add results of request into a documentation, you may use the output flag this way:

aws ec2 describe-volumes --query 'Volumes[*].{ID:VolumeId,InstanceId:Attachments[0].InstanceId,AZ:AvailabilityZone,Size:Size}' --output table

and then get the following result:

------------------------------------------------------
|                   DescribeVolumes                  | 
+------------+----------------+--------------+-------+
|     AZ     |      ID        | InstanceId   | Size  |
+------------+----------------+--------------+-------+
|  us-west-2a|  vol-e11a5288  |  i-a071c394  |  30   |
|  us-west-2a|  vol-2e410a47  |  i-4b41a37c  |  8    |
+------------+----------------+--------------+-------+

Query specific data

aws --query  ...

The query flag will allow to specify a JMESPath query to use in filtering response data. JMESPath is a standard defining a query language for JSON.

You can find full detailed informations here:

Controlling command output from the AWS CLI - AWS Command Line Interface

Control the format of the output from the AWS Command Line Interface (AWS CLI).

AWS Command Line Interface

Let say you want to describe volumes available in EC2 service, you will have to execute following command:

$ aws ec2 describe-volumes

And you will get this kind of answer, given you configured output to json:

{
    "Volumes": [
        {
            "AvailabilityZone": "us-west-2a",
            "Attachments": [
                {
                    "AttachTime": "2013-09-17T00:55:03.000Z",
                    "InstanceId": "i-a071c394",
                    "VolumeId": "vol-e11a5288",
                    "State": "attached",
                    "DeleteOnTermination": true,
                    "Device": "/dev/sda1"
                }
            ],
            "VolumeType": "standard",
            "VolumeId": "vol-e11a5288",
            "State": "in-use",
            "SnapshotId": "snap-f23ec1c8",
            "CreateTime": "2013-09-17T00:55:03.000Z",
            "Size": 30
        },
        {
            "AvailabilityZone": "us-west-2a",
            "Attachments": [
                {
                    "AttachTime": "2013-09-18T20:26:16.000Z",
                    "InstanceId": "i-4b41a37c",
                    "VolumeId": "vol-2e410a47",
                    "State": "attached",
                    "DeleteOnTermination": true,
                    "Device": "/dev/sda1"
                }
            ],
            "VolumeType": "standard",
            "VolumeId": "vol-2e410a47",
            "State": "in-use",
            "SnapshotId": "snap-708e8348",
            "CreateTime": "2013-09-18T20:26:15.000Z",
            "Size": 8
        }
    ]
}

It may be verbose, and not very handy. And this is where the query flag starts to be interesting since it will allow to reduce the result payload to only what you are interested in. For example, you want, the VolumeId, the AvailabilityZone and the Size, you will have to execute the following command:

aws ec2 describe-volumes --query 'Volumes[*].{VolumeId,AvailabilityZone,Size}'

Result will be the following one:

[
    {
        "AvailabilityZone": "us-west-2a",
        "VolumeId": "vol-e11a5288",
        "Size": 30
    },
    {
        "AvailabilityZone": "us-west-2a",
        "VolumeId": "vol-2e410a47",
        "Size": 8
    }
]

You can go even further by providing aliases.

aws ec2 describe-volumes --query 'Volumes[*].{ID:VolumeId,AZ:AvailabilityZone,Size:Size}'

providing the following result:

[
    {
        "AZ": "us-west-2a",
        "ID": "vol-e11a5288",
        "Size": 30
    },
    {
        "AZ": "us-west-2a",
        "ID": "vol-2e410a47",
        "Size": 8
    }
]

Filter result content

Capabilities are almost limitless given you know how to handle JMESPath query language. It is even possible to filter responses with expressions:

$ aws ec2 describe-volumes \
    --filters "Name=availability-zone,Values=us-west-2a" "Name=status,Values=attached" \
    --query 'Volumes[?Size > `50`].{Id:VolumeId,Size:Size,Type:VolumeType}'

Here we want to get only Volumes having a size greater than 50Gb. The powerful tip is that you don't have to write code to handle this kind of filtering, you just have to leverage the power of the filter flag.

Choose the profile

There are multiple ways to configure profile. It is also possible to configure it as a flag of the command executed, it might be handy in some situations. You basically have to add profile this way:

aws configure --profile

Region configuration

As for `profile` option, there are multiple ways to provide `region` value. Region will influence the target endpoint used by the CLI to dialog with the expected region.

Conclusion

Options are multiples as flags of the command line. Most of time, they have alternatives for example as Environment variables. Knowing them will allow you to be more proficient at the tasks you need to deal with on a daily basis. Not using these powerful options may make your work harder, as you would have to fix the needed feature.

How to install AWS CLI v1 on Mac

Alexis Kinsella — Tue, 19 May 2020 22:45:37 GMT

First of all, you need to know there are 2 versions of AWS CLI. In this article we will focus on AWS CLI v1 install as it is the most common and most known version of the AWS CLI.

Prerequisites

AWS CLI v1 relies on Python, and is compatible either with Python 2 or Python 3.

You can check your Python version with the following command line:

$ python --version

if your computer doesn't already have Python, you will first have to install it.

Install from Zip

This is not the most straightforward way to install the AWS CLI, but you can install it from the Zip bundle that is downloadable from S3.

You can install AWS CLI v1 with the following command:

curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
unzip awscli-bundle.zip
sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws

Verify installation

If everything is ok, you should be able to execute the following command and see as a result the version number of the CLI:

$ aws --version
aws-cli/1.17.4 Python/3.7.4 Darwin/18.7.0 botocore/1.13

Install with pip

If you prefer, you can go also with pip to install the CLI. To proceed, you will have to execute this command:

pip3 install awscli --upgrade --user

Then, you should be able also, to get a result by typing the command `aws --version`.

More informations

If you want more informations, you can refer to AWS CLI install page from the official documentation, following this link:

https://docs.aws.amazon.com/cli/latest/userguide/install-macos.html

Rust language is 1 year old

Alexis Kinsella — Tue, 17 May 2016 08:40:10 GMT

The Rust language aims to offer:

Uncompromising performance and control,
Prevention of many categories of bugs such as concurrency issues,
Ergonomics at the height of languages like Python and Ruby.

A year separates version 1.8.0 and the released version of version 1.0.0. To be more specific, this represents nearly 12,000 commits, and no less than 700 contributors. Remarkably, the language has become the most popular language for developers on StackOverflow.

The Rust anniversary article also offers concrete cases of adopting the language:

The DropBox use case is particularly interesting because it highlights how the company used Rust to develop the software to control the hardware they developed in an effort to become self-sufficient. screws from Amazon Web Services. Needless to underline the criticality of the task for a company which decides to operate on its own equipment on such a scale. While DropBox's back-end infrastructure has historically been written in Go, key issues such as memory footprint and lack of control over server usage have prompted components to be rewritten in Rust. According to Jamie Turner, the advantages of Rust are numerous: advanced abstraction capabilities, no nulls, no segfaults, no leaks, but close to C performance and adequate memory control.
In a second feedback, the article tells us about Servo, and peripheral developments that are slowly starting to land in the Firefox code base, among other things, the mp4 metadata parsing task on OSX and Linux since Firefox version 45 . Although the code still works in test mode, no less than 1 billion execution reports have been compared with the C++ version with 100% accuracy. This example, however, remains the visible part of the iceberg, since other pieces of code should be integrated in the long term.

During this first year, the focus was given particularly to improving Rust, both on the ecosystem part, as well as on the supported platforms, the tools, the compiler, or even the language itself. The article details each of these categories.

The first Rust language conference, RustConf, is scheduled for September 9-10, 2016 in Portland. If the Rust language is of interest to you, and you live in Europe, don't worry, RustFest is also scheduled for Berlin on September 17, 2016.

Finally, if you want to follow Rust news, you can subscribe to the This week in Rust newsletter to keep up to date with what's new in the ecosystem.

Zero downtime deployment avec Node.js et Express, une première étape ...

Alexis Kinsella — Mon, 21 Jul 2014 09:00:20 GMT

Lorsqu’on souhaite stopper ou redémarrer un serveur, différentes solutions s’offrent à nous. Parmi elles, la possibilité d’envoyer un signal de type SIGTERM au processus.

Cette solution est couramment utilisée, malheureusement cela entraîne la coupure des connexions en cours sans permettre au serveur d’honorer les requêtes en cours de traitement.

Dans l’objectif de fournir une meilleure qualité de service, il est important d’honorer toute requête entrante. Comment faire, donc, pour permettre le redémarrage d’un serveur en douceur, sans couper brutalement les connexions en cours ?

Le protocole HTTP permet au serveur de répondre aux requêtes entrantes par un status code 503 qui signifie que le service n’est pas disponible (Service Unavailable). L’idée est donc de renvoyer ce status code dès lors que le process a reçu un signal SIGTERM, tout en laissant le temps au serveur de terminer le traitement des requêtes HTTP en cours, puis de stopper le serveur une fois que les requêtes en cours sont traitées. Il est toujours possible de killer le process du serveur si cela met trop longtemps.

Node.js

Avec Node.js, il est possible d’écouter les signaux reçus par le système et d’y réagir. Il est donc possible de mettre en place une mécanique qui instruit le serveur de répondre aux nouvelles requêtes entrantes par un status code 503, puis de couper le serveur une fois les requêtes en cours traitées.

Cela donne le code suivant:

start = new Date()

# Express
app = express()

gracefullyClosing = false

app.configure ->
    app.set 'port', process.env.PORT or 8000
app.use (req, res, next) ->
    return next() unless gracefullyClosing
	res.setHeader "Connection", "close"
	res.send 503, "Server is in the process of restarting"

app.use app.router

app.get '/', (req, res) -> 
	res.send 200, 'OK'

httpServer = app.listen app.get('port')

process.on 'SIGTERM', ->
    logger.info "Received kill signal (SIGTERM), shutting down gracefully."
    gracefullyClosing = true
    
httpServer.close ->
    logger.info "Closed out remaining connections."
    process.exit()
    
setTimeout ->
    console.error "Could not close connections in time, forcefully shutting down"
    process.exit(1), 30 * 1000

L’appel de la fonction close sur l’instance de serveur HTTP renvoyée Express, permet au serveur de terminer le traitement des requête en cours avant de s’arrêter.

Si votre application Node.js est correctement redondée avec un reverse proxy (Nginx, HAProxy) devant les différentes instances, les requêtes entrantes seront redirigées vers d’autres instances en état de traiter les requêtes. Cela sera le cas dès lors que votre application répondra aux requêtes entrantes avec des status code 502 ou 503 par exemple.

Nginx

Si votre application est déployée derrière un reverse proxy tel que Nginx, il suffit de configurer celui-ci avec plusieurs flux upstream vers différentes instances de votre application pour qu’il soit capable de passer la main à une autre instance lorsqu’il reçoit un code erreur 502 ou 503 en réponse à une requête transmise à une des instances.

upstream my_app_upstream {
	server 127.0.0.1:7000;
	server 127.0.0.1:8000;
	server 127.0.0.1:9000;
}

Ensuite, il faut déclarer ce que Nginx doit faire lorsqu’il reçoit une réponse 502 ou 503 et le tour est joué. Ici, nous indiquons à Nginx via la directive proxy_next_upstream de faire suivre la requête au prochain flux upstream lorsqu’il reçoit en réponse une erreur, un timeout ou bien un code HTTP 502 ou 503:

location /app {
	...
	proxy_next_upstream error timeout http_502 http_503;
	...
	proxy_pass http://my_app_upstream;
}

Limitations

Ce fonctionnement décrit dans cet article répond bien aux besoins d’applications traitant des requêtes HTTP simples, néanmoins il ne répond pas au problème des configurations ayant activé l’option de keepalive pour les connexions HTTP, ni aux applications utilisant les websockets. Il faudra dès lors trouver une solutions adaptée.

Conclusion

La mise en oeuvre de la notion de Gracefully Closing lors d’un redémarrage pour raison de déploiement d’une nouvelle version, est une première étape importante pour arriver à faire du Zero Downtime Deployment. Cela permet, non seulement d’honorer les requêtes en cours de traitement, mais également à vos reverse proxy de prendre connaissance de l’absence de service et redispatcher les requêtes sur d’autres serveurs avant que le flux upstream soit coupé.

Clusteriser votre application Node.js

Alexis Kinsella — Thu, 17 Jul 2014 09:00:24 GMT

Les application Node.js sont par nature mono-threadées, or les serveurs, de nos jours, sont presque* toujours multi-core. Pour exploiter l’ensemble des capacités de ces serveurs, il est nécessaire de pouvoir exploiter tous les cores.

Pour cela, il existe principalement 2 techniques:

Lancer plusieurs instances d’une application Node.js sur différents avec un reverse proxy pour load balancer les requêtes entrantes
Lancer une application Node.js en mode cluster

Dans l’idéal, il faut lancer autant d’instances qu’il y a de cores sur la machine. Cela permet de partager au mieux la puissance de la machine entre les différentes instances sans pour autant dégrader les performances en partager les cores entre plusieurs instances.

Nous allons dans cet article nous intéresser au second cas de figure, c’est à dire le lancement d’application Node.js en mode cluster.

* Le terme “presque” est utilisé ici car de nombreux serveurs cloud d’entrée de gamme restent mono-threadés (VPS et instances EC2 1er prix, …).

Le module cluster

Le module cluster, bien que marqué comme ayant une API expérimental dans la documentation de Node.js, est aujourd’hui largement utilisé.

Son principe est simple, lorsqu’une application est lancée en mode cluster, un premier process est démarré en mode master. Le process master n’a pas pour rôle de traiter les requêtes entrantes à proprement parler, mais plutôt à les dispatcher aux aux process forkés qui eux sont dédiés au traitement des requêtes. Il sont lancés dans le mode worker.

La responsabilité de l’instanciation de forks en mode worker est du ressort du process master, et les règles sont de fork sont laissées à la responsabilité du développeur. Tout au long de la vie du cluster, des événements sont générés aussi bien par le master que par les workers. Il est important de s’y abonner pour être capable de réagir à des changements dans le cluster (Crash d’un worker, par exemple).

Un exemple simple de cluster est proposé par la documentation:

cluster = require("cluster")
http = require("http")
numCPUs = require("os").cpus().length

if cluster.isMaster
  # Fork workers.
  i = 0
  while i < numCPUs
    cluster.fork()
    i++
  cluster.on "exit", (worker, code, signal) ->
    console.log "worker " + worker.process.pid + " died"
else
  # Workers can share any TCP connection
  # In this case its a HTTP server
  http.createServer((req, res) ->
    res.writeHead 200
    res.end "hello world\n"
  ).listen 8000

Bien que les API de clusterisation proposent une API assez simple à gérer, il faut néanmoins s’occuper de plusieurs points de détail, et le risque de mal faire est rapidement arrivé. Il est donc conseillé de s’appuyer sur un module tiers pour traiter ce sujet.

Le module recluster, très intéressant de part sa simplicité et sa maturité, permet de s’affranchir de toute cette complexité de mise en oeuvre.

recluster

Pour l’installer, il suffit de taper la commande suivante:

npm install recluster --save

Pour instancer une application en mode cluster, il suffit d’ajouter le script suivant (cluster.coffee, par exemple) à votre application:

recluster = require("recluster")
path = require("path")

cluster = recluster(path.join(__dirname, "server.js"), { ### Options ### })
cluster.run()

process.on "SIGUSR2", ->
  console.log "Got SIGUSR2, reloading cluster..."
  cluster.reload()

console.log "spawned cluster, kill -s SIGUSR2", process.pid, "to reload"

Le module recluster attend en paramètre le chemin du script. Il suffit ensuite de lancer le fichier cluster.js au lieu du fichier server.js, et le tour est joué ! (Les exemples étant en CoffeeScript, il est nécessaire d’avoir au préalable compilé les fichiers CoffeeScript).

Zero downtime reloading

Le modules recluster permet de mettre à jour une application sans coupure de service, il faut pour cela appeler la fonction reload sur la variable cluster.

Dans l’exemple ci-dessus, le signal SIGUSR2, permet d’indiquer au programme que le cluster doit être rechargé. Celui-ci rechargera les différents workers en se basant sur les options passées en paramètre (timeout de rechargement, etc). Les requêtes en cours de traitement seront honorés sans coupure brutale, dans la limite d’un timeout défini par les options de configuration du module.

Cette fonctionnalité est particulièrement intéressantes lors que l’application doit être mise à jour sans coupure de service. Ainsi, la base de code peut-être mise à jour, puis les workers redémarrés un par un une fois les requêtes en cours traitées.

Il est possible de s’appuyer sur d’autres conditions pour recharger les workers d’un cluster, par exemple, il est possible de s’appuyer sur la modification du fichier package.json pour déclencher un rechargement avec le code suivant:

fs = require 'fs'
fs.watchFile "package.json", (curr, prev) ->
    console.log "Package.json changed, reloading cluster..."
    cluster.reload()

Conclusion

Ne vous laissez pas impressionner par la notion de clusterisation. Elle est très simple à mettre en oeuvre dans le monde Node.js grâce à une API de base faisant parti des core modules, et de nombreux modules s’appuyant dessus.

Gérer les erreurs avec Node.js

Alexis Kinsella — Tue, 15 Jul 2014 09:00:59 GMT

Lorsqu’une exception n’est pas gérée dans un programme Node.js, cela se termine en général par un crash du process de l’application. Il n’y a d’ailleurs pas grand chose à faire pour tenter de rattraper le coup si l’erreur remonte jusqu’à la boucle d’événement. C’est pourquoi, il est nécessaire de traiter les erreurs avec attention.

Si votre programme génère une erreur qui remonte jusqu’à la boucle d’événement comme suit:

process.nextTick () ->
	throw new Error("Some Bad Error")

Vous aurez le droit au message d’erreur qui suit:

Express listening on port: 9000
Started in 0.073 seconds

/Users/akinsella/Workspace/Projects/gtfs-playground/build/app-test.js:30
    throw new Error("Some Bad Error");
          ^
Error: Some Bad Error
    at /Users/akinsella/Workspace/Projects/gtfs-playground/build/app-test.js:30:11
    at process._tickCallback (node.js:415:13)
    at Function.Module.runMain (module.js:499:11)
    at startup (node.js:119:16)
    at node.js:901:3

Process finished with exit code 8

L’événement ‘uncaughtException’

Node.js vous donne une chance d’intercepter les erreurs qui remontent jusqu’à la boucle d’événement grace au dispatch l’événement de type uncaughtExcpetion.

Contrairement à ce qu’on pourrait penser, l’événement n’est pas dispatché par le process Node.js pour catcher l’erreur et permettre de continuer au programme son exécution. C’est principalement pour gérer correctement la libération de resources qui auraient été ouvertes par le programmes, et éventuellement logger de façon plus précise le contexte de l’erreur (Etat de la mémoire, etc…).

Lorsqu’une erreur remonte jusqu’à la boucle d’événement, il ne faut plus considérer l’état du programme comme étant consistant. C’est pour cette raison qu’il ne faut pas tenter de catcher l’exception dans l’idée de permettre au programme de continuer à fonctionner.

Si vous souhaitez logger un message d’erreur dans le cas d’une exception remontée jusqu’à la boucle d’événement, vous pouvevz ajouter le code suivant à votre programme:

process.on 'uncaughtException', (err) ->
    console.log JSON.stringify(process.memoryUsage())
    console.error "An uncaughtException was found, the program will end. #{err}, stacktrace: #{err.stack}"
    process.exit 1

process.nextTick () ->
    throw new Error("Some Bad Error")

Ce qui donne le résultat suivant:

/Users/akinsella/.nvm/v0.10.22/bin/node app-test.js
{"rss":12312576,"heapTotal":4083456,"heapUsed":2153648}
An uncaughtException was found, the program will end. Error: Some Bad Error, stacktrace: Error: Some Bad Error
    at /Users/akinsella/Workspace/Projects/gtfs-playground/build/app-test.js:13:11
    at process._tickCallback (node.js:415:13)
    at Function.Module.runMain (module.js:499:11)
    at startup (node.js:119:16)
    at node.js:901:3

Process finished with exit code 1

Contrairement à la gestion par défaut, nous avons pu retourner un exit code spécifique. Ici le code retour: 1
Le message de log est également différent. Nous sommes donc en mesure de maitriser le log d’erreur en cas de crash.
Par ailleurs, les informations de mémoire rendues disponibles dans les logs participeront à faciliter l’analyse du crash.

Express

Si vous utilisez un framework type Express, vous serez déchargé d’une partie du travail car les erreurs qui interviennent pendant le traitement d’une requête HTTTP sont catchées par le framework qui gérera pour vous l’erreur.

Par défaut Express se contente de logger un crash qui intervient dans le traitement d’une requête HTTP via un simple log retourné dans la réponse HTTP.

Par exemple, en exécutant le programme suivant:

express = require 'express'

app = express()

app.configure ->
    app.set 'port', process.env.PORT or 9000
    app.use app.router

app.get "/", (req, res) ->
    throw new Error("Some Bad Error")

httpServer = app.listen app.get('port')

process.on 'uncaughtException', (err) ->
    console.error "An uncaughtException was found, the program will end. #{err}, stacktrace: #{err.stack}"
    process.exit 1

console.error "Express listening on port: #{app.get('port')}"

Puis en se rendant sur l’url http://localhost:9000, Express renverra dans la réponse HTTP le log suivant:

Error: Some Bad Error
    at /Users/akinsella/Workspace/Projects/gtfs-playground/build/app-test.js:16:11
    at callbacks (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/lib/router/index.js:164:37)
    at param (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/lib/router/index.js:138:11)
    at pass (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/lib/router/index.js:145:5)
    at Router._dispatch (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/lib/router/index.js:173:5)
    at Object.router (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/lib/router/index.js:33:10)
    at next (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/node_modules/connect/lib/proto.js:193:15)
    at Object.expressInit [as handle] (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/lib/middleware.js:30:5)
    at next (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/node_modules/connect/lib/proto.js:193:15)
    at Object.query [as handle] (/Users/akinsella/Workspace/Projects/gtfs-playground/node_modules/express/node_modules/connect/lib/middleware/query.js:45:5)

Il est également possible d’activer un log plus détaillé, avec une mise en forme HTML, particulièrement utile en mode développement en ajoutant les lignes suivantes:

app.configure 'development', () ->
    app.use express.errorHandler
        dumpExceptions: true,
        showStack: true

Le résultat sera le suivant:

Express vous permet également de renseigner un middleware qui aura la possibilité d’interagir les erreurs rencontrées dans le traitement des requêtes HTTP. Ce middleware peut être utile pour logger l’erreur rencontrée ou bien encore libérer des resources associées à la requête en cours de traitement.

Il permettra également de renvoyer une réponse adaptée à l’utilisateur en cas d’erreur non gérée. Ce point particulièrement intéressant dans le cas de l’implémentation d’API REST. Le serveur devient capable de renvoyer une erreur interprétable par le client même en cas d’erreur non gérée.

Le middleware prendra la format suivant:

app.use (err, req, res, next) ->
        console.error "Error: #{err}, Stacktrace: #{err.stack}"
        res.send 500, "Something broke! Error: #{err}, Stacktrace: #{err.stack}"

Les promises

Les promises peuvent vous aider à gérer les erreurs plus efficacement grâce à leur mécanisme de gestion des erreurs.

Un traitement encapsulé dans une promise ne permettra jamais à une erreur de remonter jusqu’à l’event loop, l’erreur sera catchée par la promise qui sera remontée dans la fonction fail ou catch selon la librairie ou bien encore dans le callback d’erreur de la fonction then.

Il est donc intéressant d’encapsuler vos traitement avec des promises, non seulement pour améliorer la lisibilité du code, mais également pour sa capacité à résister aux crashs.

Les domaines

La notion de domain ne sera pas traitée dans cet article, sachez néanmoins que cette notion a été ajoutée à Node.js en version 0.10.

En bref et pour faire simple, l’idée est plus ou moins de containeriser des event emitters en les associant à un domain. En cas d’erreur dans le traitement d’un événement géré par un domain, l’exception ne fera pas crasher le programme directement, c’est le domain qui sera en charge de traiter l’erreur, mais cela ne vous sauvera pas en général d’un redémarrage du process … comme en témoigne la documentation:

Domain error handlers are not a substitute for closing down your process when an error occurs.

By the very nature of how throw works in JavaScript, there is almost never any way to safely “pick up where you left off”, without leaking references, or creating some other sort of undefined brittle state.

The safest way to respond to a thrown error is to shut down the process. Of course, in a normal web server, you might have many connections open, and it is not reasonable to abruptly shut those down because an error was triggered by someone else.

The better approach is send an error response to the request that triggered the error, while letting the others finish in their normal time, and stop listening for new requests in that worker.

In this way, domain usage goes hand-in-hand with the cluster module, since the master process can fork a new worker when a worker encounters an error. For node programs that scale to multiple machines, the terminating proxy or service registry can take note of the failure, and react accordingly.

La documentation de Node.js relative aux domains est disponibles à l’url suivante:

http://nodejs.org/api/domain.html

Détecter les versions dépassées de vos dépendances Node.js

Alexis Kinsella — Wed, 09 Jul 2014 09:00:22 GMT

L’écosystème Node.js est non seulement très jeune, mais également très dynamique. Les versions des librairies que vous utilisez ont tendance à changer très vite. Pour vous économiser la recherche permanente des versions de librairies les plus récentes pour mettre à jour votre fichier package.json, npm met à disposition l’outil npm-outdated qui se charge d’analyser vos dépendances et de vous indiquer celles qui ne sont plus à jour.

npm-outdated

L’outil npm-outdated s’utilise très simplement en l’appelant de la façon suivante:

npm outdated --depth=0

Et produira la sortie ci-dessous:

Les versions plus anciennes de l’outil ne produiront pas de sortie colorisée, il est donc intéressant de monter de version. La version de npm utilsée ici est la 1.4.9.

npm-outdated analysera aussi bien vos dépendances standards que les dépendances de développement sans faire de distinction.

La sortie retournée par l’outil ne montre que les dépendances ayant une version dépassée. Vous ne verrez donc pas les dépendances ayant une version à jour.

3 versions différentes sont renseignées: Current, Wanted et Latest. Ces versions représentent respectivement la version courante, puis la dernière version à jour correspondant au pattern de version déclaré pour votre dépendance dans le fichier package.json, et enfin la dernière version disponible de la librairie.

Option depth

Le paramètre –depth=0 permet de se limiter aux dépendances directes sans se soucier des dépendances tirées par les librairies elles-même tirées par vos dépendances directes.

Si nous utilisons le paramètre –depth=2, les dépendances indirectes commenceront alors à être matérialisées dans la sortie de l’outil:

Option json

Le paramètre –json permet quant à lui d’obtenir une sortie JSON. Cette option est particulièrement pratique pour exploiter l’information produite dans des rapports de build par exemple, ou bien pour être exploiter par d’autres outils.

En exécutant la ligne de commande suivante:

npm outdated --depth=0 --json

Vous obtiendrez la sortie suivante:

{
  "coffee-script": {
    "current": "1.6.3",
    "wanted": "1.6.3",
    "latest": "1.7.1",
    "location": "node_modules/coffee-script"
  },
  "passport-local": {
    "current": "0.1.6",
    "wanted": "0.1.6",
    "latest": "1.0.0",
    "location": "node_modules/passport-local"
  },
  "uglify-js": {
    "current": "2.4.13",
    "wanted": "2.4.14",
    "latest": "2.4.14",
    "location": "node_modules/uglify-js"
  }, ...
}

npm-update

Maintenant que vous connaissez les dernières versions disponibles, vous souhaitez peut-être en mettre certaines à jour. Pour cela, vous pouvez utiliser l’outil npm update.

Pour mettre à jour la librairie request, il faudrait exécuter la commande suivante:

npm update request

Badges pour votre repository GitHub

Le projet David vous permet de générer des badges indiquant si les versions de vos librairies sont à jour ou bien dépassées.

Ce projet est particulièrement intéressant car, non seulement, il génère des rapports pour votre projet sans que vous ayez à lever le petit doigt, mais il génère également des badges que vous pouvez exposer les pages de votre projet permettant d’indiquer l’état des versions de vos dépendances.

Pour exemple, pour savoir si le projet gtfs-playground a les versions de ses dépendances à jour, vous pouvez vous rendre sur la page suivante: https://david-dm.org/akinsella/gtfs-playground

L’outil fonctionne exclusivement avec les repositories Github. Pour construire un rapport pour votre projet, il suffit de renseigner votre organization et du nom de votre repository dans l’url suivante avant de l’appeler:

https://david-dm.org//

De même, pour obtenir le badge correspondant à votre projet, il suffit de construire la balise img comme suit:

https://david-dm.org//.

Ce qui donne le résultat suivant pour le format png:

Et pour le format SVG:

Conclusion

L’écosystème Node.js évolue rapidement. Les librairies proposent donc régulièrement de nouvelles fonctionnalités ou bien encore des corrections de bug. Il ne faut donc pas hésiter à mettre à jour ses librairies.

Attention cependant à ne pas non plus se précipiter et installer une version de librairie qui ne serait plus compatible avec votre code ou bien encore d’installer une version buggée. Il faut donc penser à faire tourner ses tests pour s’assurer qu’aucune regression n’impacte votre base de code.

Transformez votre code Node.js grâce au module de promises Bluebird

Alexis Kinsella — Fri, 04 Jul 2014 09:00:44 GMT

Lorsqu’on parle de promises dans l’écosystème Node.js, on pense immédiatement à la librairie Q. Toutefois, il existe de nombreux modules de promises proposant chacun des choses différentes. En particulier, le module bluebird se démarque grâce à des fonctionnalités tout à fait intéressantes telles que la “promisification”.

Promisification

Les core modules de Node.js fonctionnent à base de callback. Ainsi pour lire un fichier de façon asynchrone, il faut appeler la fonction readFile du module fs et traiter la réponse depuis le callback passé en dernier paramètre de la fonction lors de son appel:

fs.readFile "file.json", (err, val) ->
    if err
        console.error "unable to read file"
        try
            val = JSON.parse(val);
            console.log val.success
        catch e
            console.error "invalid json in file"

Bluebird permet de transformer le code précédent dans le code suivant:

fs.readFileAsync("file.json").then(JSON.parse).then (val) ->
    console.log val.success
.catch SyntaxError, (e) ->
    console.error "invalid json in file"
.catch (e) ->
    console.error "unable to read file"

Cette transformation est rendue possible grâce à la promisification du module fs, via l’appel de la fonction promisifyAll qui permet de transformer toutes les fonctions exposées en fonctions renvoyant des promises:

fs = require "fs"
Promise.promisifyAll fs

fs.readFileAsync("file.js", "utf8").then(...)

Selon toute vraisemblance, les fonctions du modules sont proxifiées via un wrapping changeant la signature.‌‌On pourra noter que le chaînage de fonctions catch sur la promise permet de différencier le traitement des erreurs en fonction de leur type. Ici, l’erreur de type SyntaxError est traitée différemment des erreurs typées autrement.

promisify

Il est également possible de ne promisifier qu’une seule fonction grâce à la fonction promisify:

redisGet = Promise.promisify(redisClient.get, redisClient)‌‌redisGet('foo').then () ->‌‌    #...

Il y a tout de même un piège puisque la fonction attend 2 paramètres. Le premier étant la référence de la fonction à promisifier, et le second étant l’objet auquel la fonction est rattachée.

nodeify

La fonction nodeify est également très intéressante car elle permet d’enregistrer un callback sur une promise bluebird et d’appeler celui-ci à la résolution de cette dernière:

getDataFor(input, callback) ->
    dataFromDataBase(input).nodeify(callback)

Cette possibilité est particulièrement intéressante, car elle permet de construire des API qui deviennent utilisables aussi bien par du code qui fonctionne à base de callback, qu’avec du code à base de promise.

Ainsi, si le callback est renseigné, il sera appelé. Sinon, il suffira d’exploiter la promise retournée par la fonction pour obtenir et traiter le résultat de l’appel.

Exemple exploitant le mécanisme de promise:

getDataFor("me").then (dataForMe) ->
    console.log dataForMe

Le même exemple exploitant le mécanisme de callback:

getDataFor "me", (err, dataForMe) ->
    if err
        console.error err
    console.log dataForMe

spread

En temps normal, le code suivant donnera en résultat la tableau : [1, 2, 3].

Promise.resolve([1,2,3]).nodeify (err, result) ->
    # err == null
    # result: [1,2,3]

Toutefois, l’option {spread: true} passée à l’appel de la fonction nodeify, permet de dispatcher les valeurs de résultat sur l’ensemble des arguments de la fonction de callback renseignée:

Promise.resolve([1,2,3]).nodeify (err, a, b, c) ->
    # err == null
    # a == 1
    # b == 2
    # c == 3
, {spread: true}

Conclusion

La librairie bluebird est riche en fonctions pour le moins intéressantes, vous pouvez les retrouver sur la page de documentation du projet GitHub:

Lien: https://github.com/petkaantonov/bluebird/blob/master/API.md

Locker les versions de vos dépendances Node.js

Alexis Kinsella — Mon, 30 Jun 2014 08:00:48 GMT

Node.js dispose d’un gestionnaire de dépendances très efficace et incontournable: npm.

Reposant sur les informations de dépendances déclarées dans le fichier package.json, il s’occupera de récupérer les dépendances déclarées et de les installées le dossier node_modules de votre projet, via l’exécution de la commande:

npm install

Pourquoi ?

Contrairement à la mécanique proposée par Maven dans le monde Java, Node.js repose sur une mécanique de dépendances hiérarchiques. C’est à dire que npm va récupérer et installer pour chaque niveau – application, et dépendances elles-mêmes – les librairies associées.

Par exemple, si votre projet, ainsi que les librairies dont il dépend, utilisent la librairie mkdirp, alors npm va charger et installer la dépendance mkdirp à la fois dans le dossier node_modules de votre projet, mais également dans le dossier node_modules de la librairie de votre projet.

Pour chaque librairie, dont vous dépendez, vous devez déclarer un pattern de sélection de version qui indiquera à npm quelle version de dépendance télécharger. Les patterns disponibles sont variés, allant du wildcard, à la version exacte.

Quels risques ?

Très rapidement, vous serez confronté à des problématiques de versions de dépendances qui évoluent.

Cela empêchera, au mieux, vos applications de tourner correctement. Au pire engendrera des bugs subtiles et très difficiles à détecter ou corriger, avec le risque de mettre en péril votre business, ou bien la qualité perçue de vos logiciels.

Quelles solutions ? Quels outils ?

Une technique possible pour se prémunir de ce problème, est de locker les versions de vos dépendances en indiquant des patterns de version plus restrictifs, voir complètement fixés.

Vous ne serez pas sorti d’affaire pour autant. Vous aurez beau fixer les versions de vos dépendances, celles-ci reposent également sur d’autres dépendances, pour lesquelles, leur auteurs respectifs n’appliquent peut-être pas les règles de gestion de versions qui vous arrange.

Ainsi, il est possible qu’une librairie donnée déclare une version de dépendance avec un wildcard. De fait, vous serez amené, à terme, à récupérer une version qui sera, soit incompatible avec votre code, soit tout simplement buggée.

Il faudrait, dans l’idéal, pouvoir locker toute la hiérarchie des versions de dépendances et pouvoir réinstaller ces dépendances de façon répétée dans les versions sélectionnées.

La bonne nouvelle, c’est qu’il existe des solutions pour répondre à ce besoin, dont les outils lockdown et npm-shrinkwrap.

lockdown

Le module lockdown propose de locker les versions des dépendances de votre projet dans le but de vous assurer que le code que vous développez reposera sur les même version de dépendances que ce soit dans votre IDE ou bien pendant vos phases de tests ou bien en production.

L’usage de Lockdown vous permettra de continuer à utiliser la commande npm install, tout en vous assurant d’obtenir le même code à chaque fois que la commande sera exécutée, ainsi qu’en vous évitant d’avoir à copier le code de vos dépendances dans votre gestionnaire de code source ou d’avoir à maintenir un repository privé npm.

Comme expliqué précédemment, même si vous exprimez la version exacte de vos dépendances dans votre fichier projet package.json, vous êtes toujours vulnérable à l’apparition soudaine d’une incompatibilité avec l’une de vos dépendances.

Par exemple, si votre projet dépend d’un package avec une version spécifique, qui, elle même dépend d’un autre package déclaré avec un version range, vous risquez de voir la version de votre dépendance changer lors d’une future exécution de la commande npm install.

Cet exemple n’est hélas pas la seule cause de problème. D’autre actions peuvent accidentellement casser le code de votre application:

En poussant une nouvelle version de librairie qui ne supporte plus la version de Node.js que vous utilisez
En introduisant un bug dans du code qui fonctionnait bien au préalable
…

Utilisation

1. Installez une dépendance dans votre projet. Par exemple, en ligne de commande:

npm install @ --save

2. Générez le fichier lockdown.json en exécutant la commande lockdown-relock:

node_modules/.bin/lockdown-relock

3. Puis, ajoutez le fichier nouvellement créé à votre gestionnaire de code source.

Installer vos dépendances grâce au fichier lockdown.json

Une fois le fichier lockdown.json généré, il vous suffit d’appeler, de façon tout à fait classique, la commande npm install qui installera l’ensemble des dépendances dans les versions attendues.

Points forts

Lockdown se veut être un outil vous garantissant d’utiliser un code source identique, aussi bien en développement qu’en production. C’est pour cela, qu’en plus de stocker les versions des dépendances utilisées, il stocke également des checksums du code utilisé. Il permet donc de savoir qu’un code source dans une version donnée a été modifié, et vous alerte du problème.

Autre point intéressant: le projet est maintenu par Mozilla, ce qui a tendance à rassurer quant au sérieux et la pérennité de l’outil.

npm-shrinkwrap

Tout comme l’outil lockdown, la commande npm-shrinkwrap propose de figer les versions de dépendances de votre application.

Pas de souci d’installation néanmoins, puisque la commande est directement disponible dans la distribution de npm. Npm venant avec l’installation de Node.js, pas besoin de bouger le petit doigt pour avoir l’outil à disposition.

Le fichier de stockage des informations de version s’appelle quant à lui npm-shrinkwrap.json.

Utilisation

L’utilisation de npm-shrinkwrap est tout à fait similaire à celle de lockdown, il suffit d’utiliser la commande npm install pour installer vos dépendances, puis exécuter la commande npm shrinkwrap pour générer le fichier de version.

Gestion des checksums

Contrairement à lockdown, la commande npm-shrinkwrap ne gère pas de checksum. Néanmoins, il existe des solutions de remplacement, telles que le package npm-seal qui se propose de venir compléter la commande npm-shrinkwrap en proposant la fonctionnalité manquante.

Contrairement à l’utilitaire npm-shrinkwrap, le package npm-seal est un utilitaire 3rd party qui doit être installé en complément avec la commande suivante:

npm install seal -g

Points forts

Nous l’avons déjà vu, la commande est intégrée à la distribution de l’outil npm. Par ailleurs, l’outil est également maintenu par une entreprise gage de sérieux: Uber.

Bon à savoir

Même si les outils lockdown et npm-shrinkwrap vous proposent des solutions différentes, il est tout à fait possible de combiner l’usage de ces deux outils sans que cela pose de problème.

Activer le support JSONP avec Express

Alexis Kinsella — Fri, 27 Jun 2014 08:00:20 GMT

Si vos services webs sont destinés à être appelés depuis d’autres domaines dans un browser web, il sera nécessaire d’activer le support du JSON Padding pour vos services REST JSON.

Le support du JSON Padding selon le serveur utilisé est très variable. Côté node.JS avec Express, la fonctionnalité est très bien supportée, mais n’est pas activée par défaut. Il faut donc l’activer dans la configuration de votre serveur.

Malheureusement, il faut chercher un peu pour trouver comment faire. Voici donc pour vous faire gagner un peu de temps comment configurer votre serveur:

app.set 'jsonp callback name', 'callback'

La configuration de la clé: ‘jsonp callback name’ permet de spécifier le nom du paramètre de queryString qui correspondra au callback encapsulant le JSON de retour. Dans notre cas, ici, la variable s’appellera: ‘callback’.

Un appel sans callback donnera le résultat suivant:

akinsella@~$ curl http://localhost:8000/api/v1/conferences
[
  {
    "id": 12,
    "backgroundUrl": "http://blog.xebia.fr/images/devoxxuk-2014-background.png",
    "logoUrl": "http://blog.xebia.fr/images/devoxxuk-2014-logo.png",
    "iconUrl": "http://blog.xebia.fr/images/devoxxuk-2014-icon.png",
    "from": "2014-06-12",
    "name": "DevoxxUK 2014",
    "description": "The Devoxx UK annual event.",
    "location": "Business Design Center",
    "enabled": true,
    "to": "2014-06-13"
  }, ...]

Les en-têtes spécifieront un Content-Type de type ‘application/json’:

akinsella@~$ curl -I http://localhost:8000/api/v1/conferences              
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 4397
Date: Sat, 14 Jun 2014 14:11:36 GMT
Connection: keep-alive

Alors qu’en appellant des resources avec le paramètre de queryString: ‘callback’, le serveur générera des réponses avec un Content-Type de type ‘text/javascript’:

akinsella@~$ curl -I http://localhost:8000/api/v1/conferences\?callback\=cb
HTTP/1.1 200 OK
Content-Type: text/javascript; charset=utf-8
Content-Length: 4430
Date: Sat, 14 Jun 2014 13:56:03 GMT
Connection: keep-alive

Le contenu de la réponse sera le suivant:

akinsella@~$ curl http://localhost:8000/api/v1/conferences\?callback\=cb
typeof cb === 'function' && cb([
  {
    "id": 12,
    "backgroundUrl": "http://blog.xebia.fr/images/devoxxuk-2014-background.png",
    "logoUrl": "http://blog.xebia.fr/images/devoxxuk-2014-logo.png",
    "iconUrl": "http://blog.xebia.fr/images/devoxxuk-2014-icon.png",
    "from": "2014-06-12",
    "name": "DevoxxUK 2014",
    "description": "The Devoxx UK annual event.",
    "location": "Business Design Center",
    "enabled": true,
    "to": "2014-06-13"
  }, ...]);

Cerise sur le gâteau, la génération du résultat est parfaitement gérée: la sortie obtenue intègre les best pratiques de codage permettant d’éviter d’être vulnérable à certaines attaques XSS associées à l’utilisation du JSON Padding.

Désactiver l'en-tête de réponse 'x-powered-by' avec Express

Alexis Kinsella — Wed, 25 Jun 2014 09:00:21 GMT

Il peut-être jugé embêtant niveau sécurité de dévoiler le type de serveur qui fait tourner vos services web. Il est donc préférable de ne pas envoyer cette information dans les en-tête de réponses HTTP avec Express.

Exemple de réponse HTTP avec l’en-tête ‘X-Powered-By’ activé:

akinsella@~$ curl -I http://localhost:8000/api/v1/conferences
HTTP/1.1 200 OK
X-Powered-By: Express
Content-Type: application/json; charset=utf-8
Content-Length: 4397
Date: Sat, 14 Jun 2014 22:43:15 GMT
Connection: keep-alive

Pour ce faire, il vous suffit de déclarer l’option suivante dans le code de configuration de votre application:

app.disable "x-powered-by"

Les clients HTTP connectés à vos services ne recevrons ainsi plus cette information:

akinsella@~$ curl -I http://localhost:8000/api/v1/conferences
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 4397
Date: Sat, 14 Jun 2014 22:43:15 GMT
Connection: keep-alive

Créer un middleware de log des requêtes HTTP entrantes pour Express

Alexis Kinsella — Mon, 23 Jun 2014 08:00:50 GMT

Vous souhaitez pouvoir logger les connections HTTP rentrantes avec Node.js et Express ? Rien de plus simple! Il suffit de déclarer un middleware de la façon suivante:

util = require 'util'

module.exports = (req, res, next) ->
    console.log  """---------------------------------------------------------
                    Http Request - Pid process: [#{process.pid}]
                    Http Request - Url: #{req.url}
                    Http Request - Query: #{util.inspect(req.query)}
                    Http Request - Method: #{req.method}
                    Http Request - Headers: #{util.inspect(req.headers)}
                    Http Request - Body: #{util.inspect(req.body)}
                    ---------------------------------------------------------"""

    next()

Comme vous pouvez le voir, aucun module externe n’est nécessaire. Il suffit ensuite d’intégrer votre nouveau middleware dans le code de configuration de votre serveur Express, comme suit:

express = require 'express'
requestLogger = require './lib/requestLogger'

app = express()

app.configure ->
    console.log "Environment: #{app.get('env')}"
    app.set 'port', 8000

    ...

    app.use requestLogger

    ...

    app.use app.router

app.listen app.get('port')

Le log résultant d’une requête HTTP prendra la forme suivante:

---------------------------------------------------------
Http Request - Pid process: [26074]
Http Request - Url: /
Http Request - Query: {}
Http Request - Method: GET
Http Request - Headers: { host: 'localhost:9000',
  connection: 'keep-alive',
  accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
  'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36',
  'accept-encoding': 'gzip,deflate,sdch',
  'accept-language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4',
  cookie: '...' }
Http Request - Body: undefined
---------------------------------------------------------