Skip to content

S3 storage

Table of Contents

What is a S3 Storage?

S3 (Simple Storage Service) is an object storage service. It stores data as objects inside buckets, rather than as files in directories. Each object consists of:

  • Data (the file itself),
  • Metadata (descriptive tags), and
  • A unique key (the object's identifier within a bucket).

Unlike traditional file systems like our standard partitions (NetApp), S3 is not a native POSIX filesystem. It's not designed for direct file editing or low-latency access. While it can be mounted using tools like rclone mount or s3fs, this is more of a workaround. It may not behave exactly like a regular shared folder. Instead, most modern tools and workflows such as Nextflow, Snakemake, and Python libraries are increasingly adapted to interact with S3 natively. This offers better performance and scalability for data-intensive pipelines.

S3 offers several advantages: it's scalable, ideal for storing large volumes of data (including hundreds of GB or more). It also supports fine-grained access control, and integrates well with cloud-native or container-based pipelines. In our case, S3 is used as static storage for raw data and final results, not as a working directory where files are constantly modified.

We use Minio, a local S3-compatible server (instead of AWS S3) to manage this storage on-premises.

How to access

Note

S3 only available from the new IRB cluster

By default, MinIO web access to S3 is available to all BBG members. It allows browsing buckets and downloading small files (e.g., metadata, images).

MinIO web access to S3 is now only available to users on paid plans. It is no longer accessible to IRB members.

From now on, access to S3 with write permissions is only possible via the terminal (e.g., to navigate, run Python scripts, or execute Nextflow pipelines outside Seqera). To use terminal access, you need to generate S3 credentials.

If you want to navigate S3 buckets — whether to check the file structure or read PDFs, Excel files, or text files — you can do so using Open OnDemand or by mounting them. You can create the credentials directly from Open OnDemand or with the local tool. Check the corresponding section for details.

If you only want to use S3 files in your Nextflow pipelines through seqera platform, no further setup is needed. Credentials are already configured in the pipelines; you only need to export them.

Terminal from IRB cluster

To use it via terminal, once in the IRB cluster:

# Load conda environment
$ ml load anaconda3/2023.09-0-yjzjr4h
$ module load anaconda3
$ conda activate s3-minio

# Generate credentials
# -r -> creates the ~/.config/rclone/rclone.conf file
# -d -> duration in days for credentials validity
$ python3 /apps/scripts/irb-storage-public-scripts/minio/minio-sts-credentials-request.py -u mgrau -r -d 365

# test
$ rclone lsf irb-minio://bbg

This script (minio-sts-credentials-request.py) generates the config file used by rclone:

$ cat ~/.config/rclone/rclone.conf

[irb-minio]
type = s3
provider = Minio
endpoint = http://irbminio.irbbarcelona.pcb.ub.es:9000
acl = bucket-owner-full-control
env_auth = false
access_key_id = ***
secret_access_key = ***
session_token = ***

If you want to use another S3 client, you need to create the credentials file. For example, for aws-cli (replace with your keys/token from the rclone.config):

$ cat ~/.aws/credentials

[default]
aws_access_key_id= **
aws_secret_access_key= **
aws_session_token= **
endpoint_url = http://irbminio.irbbarcelona.pcb.ub.es:9000

As mentioned earlier, you can mount an S3 bucket as a POSIX-like partition (similar to what McGyver does):

Warning

This allows you to browse S3 buckets as if they were part of the local file system, although it's typically read-only and it is not recommended for high-performance or heavy I/O use cases.

$ ls /home/mgrau/s3/bbg-scratch/
# Mount
$ rclone mount irb-minio:bbg-scratch /home/mgrau/s3/bbg-scratch --vfs-cache-mode off --read-only &
[1] 636085
$ ls /home/mgrau/s3/bbg-scratch/
work
# Unmount
$ fusermount -u /home/mgrau/s3/bbg-scratch
[1]+  Done                    rclone mount irb-minio:bbg-scratch /home/mgrau/s3/bbg-scratch
--vfs-cache-mode off --read-only
$ ls /home/mgrau/s3/bbg-scratch/
$

Once your aws credentials are generated, you can use stu as a TUI explorer. You don't need to mount the S3 bucket:

spack load stu
stu

stu TUI demo

Open OnDemand

In addition to command-line from the IRB cluster, you can also
access S3 via Open OnDemand. The Open OnDemand S3 interface is read-only: you can browse and
download files, but you cannot modify existing objects or upload new ones from this interface.

  1. Go to
    Open OnDemand Dashboard
    and log in with your LDAP credentials.

  2. Verify that your credentials are valid, or create new ones if they have expired. s3 open onDemand credentials

  3. Once the credentials are valid, click on the "Home Directory System" section.

    Open OnDemand Home Directory System app screenshot

  4. Select the S3 storage on the left panel and navigate through the folders.
    You can read PDFs and text files directly in the browser.
    To view Excel or HTML files, you need to download them first.

    Open OnDemand S3 storage browser screenshot

Mount bucket on local. Terminal on local

It is also possible to mount the S3 storage as a local partition (as we do with MacGyver and the workspace/). This is a read-only option.

Below are the instructions for Ubuntu. It is also possible on Windows and Mac; see the official IRB documentation: IRB documentation.

  1. Create a conda environment with rclone

    conda create -n s3 rclone -c conda-forge -y
    conda activate s3
    
  2. Download and run the app from IT

    wget https://github.com/its-irb/irb-storage-public-scripts/releases/latest/download/minio-rclone-copy-GUI-linux
    chmod +x minio-rclone-copy-GUI-linux
    ./minio-rclone-copy-GUI-linux
    
  3. Mount:

    Login using your LDAP credentials.

    Select the s3:

S3 selection screen

If your Minio credentials are not available or expired, it will offer to renew them.

Minio credentials renewal prompt

In the destination path, put the s3 bucket name and click "Mount destination folder":

Mount destination folder dialog

And that's it. Now you can access from the file browser and the terminal:

S3 mounted in file browser

S3 mounted in terminal

This tool can be used also for moving data between the netapp and the s3,
but this functionality is still in a testing phase and should be used with caution.

How to use it

Seqera and Nextflow

As mentioned, you don’t need to include any credentials explicitly in Seqera to run a job. When accessing S3 data from a pipeline, remember to select the secrets and include the info in the config:

S3 secrets configuration screenshot

If using nextflow in a terminal, you can add the credentials to the nextflow.config

Python

To access an S3 file from a Python script, there are different libraries: boto3, s3fs and dask. All three will automatically use the credentials in ~/.aws/credentials if it exists.

Note

AI recommends... Dask! Compared to s3fs and boto3, is much faster and optimized for parallel and distributed computing, enabling efficient processing of files hundreds of gigabytes in size. While boto3 is a low-level client and s3fs provides filesystem-like access, Dask builds on them to offer high-level, scalable data workflows that can handle massive datasets seamlessly.

boto3

import boto3
import pandas as pd
import botocore


# Initialize the S3 client
s3 = boto3.client('s3')

# Define your bucket and file key
bucket_name = 'bbg'
file_key = 'data/example/file.vcf'

# Read the file object directly, without downloading
response = s3.get_object(Bucket=bucket_name, Key=file_key)
content = response['Body'].read().decode('utf-8')

# Extract the data starting from the VCF header (#CHROM ...)
lines = content.strip().split('\n')
vcf_data = [line for line in lines if not line.startswith('##')]  # Skip metadata

# Load the VCF data into a DataFrame
df = pd.read_csv(StringIO('\n'.join(vcf_data)), sep='\t', comment='#')

# Display the first few rows
print(df.head())

s3fs

import s3fs
import pandas as pd

# Initialize the S3 file system
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': 'http://irbminio:9000'})

# Read the VCF file with proper handling
with fs.open("bbg/data/example/file.vcf", 'r') as f:
    # Extract the data starting from the VCF header (#CHROM ...)
    vcf_data = [line for line in f if not line.startswith('##')]

# Convert the data to a DataFrame
from io import StringIO
df = pd.read_csv(StringIO(''.join(vcf_data)), sep='\t', comment='#')

# Display the DataFrame
print(df.head())

dask

import dask.dataframe as dd
import s3fs

# Read the file using Dask (with appropriate filters)
df = dd.read_csv("s3://bbg/data/example/file.vcf",
                 sep='\t',
                 comment='#',  # Ignore metadata lines starting with `##`
                 blocksize='16MB',  # Adjust block size as needed
                 dtype='str')  # Ensures flexible data type handling

# Display the first few rows
print(df.head())