Introduction to AWS Boto in Python

Introduction to AWS Boto in Python
Initialise Boto3 Clients
Translate all descriptions into English
Detect text sentiment
Detect scooter in image
Select only rows where there was a scooter image and negative sentiment

Intro to AWS and Boto3

AWS provides storage, compute and alerts services that we can leverage in data projects. AWS services are granular, meaning they can work together, or on their own.

To interact with AWS services using Python we can use the Boto3 library:

import boto3

s3 = boto3.client(
        's3',
        region_name='us-east-1',
        aws_access_key_id=AWS_KEY_ID,
        aws_secret_access_key=AWS_SECRET)

response = s3.list_buckets()

Creating an IAM User

IAM - Identity Access Management

IAM sub-users allow us to control access to AWS services within our account.
Authenticated with Key/Secret.

To create an IAM role, we can search for IAM in the AWS Management Console. Selecting Users followed by Add User allows us to input a User name and select Programmatic Access, to allow us to use a access key/secret to connect via Boto3.

We can select ‘Attach existing policies directly’ and search for the policy we wish to attach, e.g. s3fullaccess.

We will then be provided with an Access key ID and a Secret access key that we can use to programmatically interact with AWS.

AWS Services

S3 (Simple Storage Service) - Lets us store files in the cloud.

SNS (Simple Notification Service) - Lets us send emails and texts to alert subscribers based on events and conditions in our data pipelines.

Comprehend - Performs sentiment analysis on blocks of text.

Recognition - Extracts text from images.

Diving into buckets

S3 allows us to place files in the cloud, accessible anywhere by a url.

Buckets - Analogous to Desktop folders

Own permission policy
Website storage (statis website hosting)
Generate logs about activity

Objects - Analogous to files

What can we do with buckets using Boto3?

Creating a Bucket

import boto3

# Create boto3 client
s3 = boto.client(
            's3',
            region_name='us-east-1',
            aws_access_key=AWS_KEY_ID,
            aws_secret_access_key=AWS_SECRET)

# Create bucket
bucket = s3.create_bucket(Bucket='gid-requests)

List Buckets

import boto3

s3 = boto.client(
            's3',
            region_name='us-east-1',
            aws_access_key=AWS_KEY_ID,
            aws_secret_access_key=AWS_SECRET)

# List buckets as dict
bucket_response = s3.list_buckets()

Delete Bucket

import boto3

s3 = boto.client(
            's3',
            region_name='us-east-1',
            aws_access_key=AWS_KEY_ID,
            aws_secret_access_key=AWS_SECRET)

# Delete bucket
response = s3.delete_bucket('gid-requests')

Uploading and Retrieving Files

An object can be anything: image, video file, csv, log file etc.

Objects and buckets analogous to files and folders on a desktop.

Bucket	Object
A bucket has a *name*	An object has a *key*
*Name* is a string	*Name* is full path from bucket root
*Unique name* in all of s3	*Unique key* in bucket
Contains *many* objects	Can only be in *one* parent bucket

Upload files

s3.upload_file(
    Filename='gid_requests_2019_01_01.csv', # local file path
    Bucket='gid-requests', # name of bucket we're uploading to
    Key='gid_requests_2019_01_01.csv') # name in s3 

List objects in a bucket

response = s3.list_objects(
                Bucket='gid-requests',
                MaxKeys=2, # limit response to n objects
                Prefix='gid_requests_2019_') # limit to objects starting with prefix

Get object metadata

response = s3.head_objects(
                Bucket='gid-requests',
                Key='gid_requests_2018_12_30.csv)

Download file

s3.download_file(
    Filename='gid_requests_downed.csv', # local file path to download to
    Bucket='gid-requests',
    Key='gid_requests_2018_12_30.csv')

Delete objects

s3.delete_object(
    Bucket='gid-requests',
    Key='gid_requests_2018_12_30.csv')

AWS defaults to denying permission, so we must explicitly grant access in order to access objects/buckets.

AWS Permissions Systems:

IAM - Attach a policy to specific users to control access to services, buckets and objects.
Bucket policy - Control buckets and objects within.
ACL (Access Control Lists) - Set permissions on specific objects within a bucket.
Presigned URL - Temporary access to an object.

ACLs

Entities attached to objects in s3. Common types include: 'public-read' and 'private'

# By default, when unspecified ACL is 'private'
s3.upload_file(
  Filename='potholes.csv', Bucket='gid-requests', Key='potholes.csv')

# Set ACL to 'public-read'
s3.put_object_acl(
  Bucket='gid-requests', Key='potholes.csv', ACL='public-read')

Setting ACLs on upload:

s3.upload_file(
    Bucket='gid-requests',
    Filename='potholes.csv',
    Key='potholes.csv',
    ExtraArgs{'ACL':'public-read'})

Accessing public objects

Publicly accessible s3 objects can be accessed using the url template:

https://{bucket}.s3.amazonaws.com/{key}

e.g. https://gid-requests.s3.amazonaws.com/2019/potholes.csv

Generating public objects URL:

url = "https://{}.s3.amazonaws.com/{}".format(
  "gid-requests",
  "2019/potholes.csv")

df = pd.read_csv(url)

Downloading a private file

# Download file
s3.download_file(
  Filename='potholes_local.csv',
  Bucket='gid-staging',
  Key='2019/potholes_private.csv')

# Read from Disk
pd.read_csv('./potholes_local.csv')

In memory read of file:

obj = s3.get_object(Bucket='gid-requests', Key='2019/potholes.csv')

pd.read_csv(obj['Body']) # Read StreamingBody object into Pandas

Pre-signed URLs

Grant temporary access to s3 objects, with an expiring timeframe.

# Generate Presigned URL

share_url = s3.generate_presigned_url(
  ClientMethod='get_object',
  ExpiresIn=3600, # grant access for one hour
  Params={'Bucket':'gid-requests', 'Key'='potholes.csv'}
)

pd.read_csv(share_url)

Load multiple files into one DataFrame:

# Create list to hold our DataFrames
df_list = []


# Request the list of csv's from S3 with prefix; Get contents
response = s3.list_objects(
  Bucket='gid-requests', 
  Prefix='2019/')


# Get response contents
request_files = response['Contents']

# Iterate over each object
for file in request_files:
  obj = s3.get_object(Bucket='gid-requests', Key=file['Key'])

  #Read it as DataFrame
  obj_df = pd.read_csv(obj['Body'])

  # Append DataFrame to list
  df_list.append(obj_df)

# Concatenate all the DataFrames in the list
df = pd.concat(df_list)

Converting a DataFrame to html:

df.to_html('table_agg.html',
           render_links=True,
           columns['service_name', 'request_count', 'info_link'],
           borders=0)

Upload an HTML file to S3:

s.upload_file(
  Filename='./table_agg.html',
  Bucket='datacamp-website',
  Key='table.html',
  ExtraArgs = {
    'ContentType': 'text/html',
    'ACL': 'public-read'}
)

Accessing HTML file:

https://{bucket}.s3.amazonaws.com/{key}

https://datacamp-website.s3.amazonaws.com/table.html

Uploading an image file:

s3.upload_file(
  Filename='./plot_image.png',
  Bucket='datacamp-website',
  Key='plot_image.png',
  ExtraArgs = {
    'ContentType': 'image/png',
    'ACL': 'public-read'}
)

IANA Media Types:

Full list of IANA media types can be found at, http://www.iana.org/assignments/media-types/media-types.xhtml.

Common types include: JSON, application/json; PNG, image/png; PDF, application/pdf; CSV, text/csv.

Generating an index page:

# List the gid-reports bucket objects starting with 2019/
r = s3.list_objects(BUcket='gid-reports', Prefix='2019/')

# Convert the response contents to DataFrame
objects_df = pd.DataFrame(r['Contents'])

# Create column "Link" that contains website url + key
base_url = 'http://datacamp-website.s3.amazonaws.com/'
objects_df['Link'] = base_url + objects_df['Key']

# Write DataFrame to html
objects_df.to_html('report_listing.html',
                   columns=['Link', 'LastModified', 'Size'],
                   render_links=True)

# Upload HTML file to S3
s3.upload_file(
  Filename='./report_listing.html',
  Bucket='datacamp-website',
  Key='index.html',
  ExtraArgs = {
    'ContentType': 'text/html',
    'ACL': 'public-read'}
)

Case Study: Generating a Report Repository

1) Prepare the data

Download files for the month from the raw data bucket

# Create list to hold our DataFrames
df_list = []

# Request the list of CSVs from S3 with prefix; Get contents
response = s3.list_objects(
  Bucket='gid-requests',
  Prefix='2019_jan')

# Get response contents
request_files = response['Contents]

Concatenate them into one DataFrame

# Iterate over each object
for file in request_files:
   obj = s3.get_object(Bucket='gid-requests', Key=file['Key'])

   # Read it as DataFrame
   obj_df = pd.read_csv(obj['Body'])

   # Append DataFrame to list
   df_list.append(obj_df)

# Concatenate all dfs in list
df = pd.concat(df_list)

Create a aggregated DataFrame

1) Create the Report

Write the DataFrame to CSV and HTML

# Write agg_df to a CSV and HTML file with no border
agg_df.to_csv('./jan_final_report.csv')
agg_df.to_html('./jan_final_report.html', border=0)

Generate a Bokeh plot, save as HTML

1) Upload report to shareable website

Create gid-reports bucket
Upload all three files for month to S3

# Upload Aggregated CSV to S3
s3.upload_file(Filename='./jan_final_report.csv',
               Key='2019/jan/final_report.csv',
               Bucket='gid-reports',
               ExtraArgs = {'ACL': 'public-read'})

# Upload HTML table to S3
s3.upload_file(Filename='./jan_final_report.html',
               Key='2019/jan/final_report.html',
               Bucket='gid-reports',
               ExtraArgs = {
                 'ContentType': 'text/html',
                 'ACL': 'public-read'})

# Upload Aggregated Chart to S3
s3.upload_file(Filename='./jan_final_chart.html',
               Key='2019/jan/final_chart.html',
               Bucket='gid-reports',
               ExtraArgs = {
                 'ContentType': 'text/html',
                 'ACL': 'public-read'})

Generate an index.html file that lists all the files

# List the gid-reports bucket objects starting with 2019/
r = s3.list_objects(Bucket='gid-reports', Prefix='2019/')

# Convert the response contents to DataFrame
objects_df = pd.DataFrame(r['Contents'])

# Create a column "Link" that contains website url + key
base_url = "https://gid-reports.s3.amazonaws.com/"
objects_df['Link'] = base_url + objects_df['Key']

# Write DataFrame to html
objects_df.to_html('report_listing.html',
                   columns=['Link', 'LastModified', 'Size'],
                   render_links=True)

# Upload the file to gid-reports bucket root
s3.upload_file(Filename='./report_listing.html',
               Key='index.html',
               Bucket='gid-reports',
               ExtraArgs = {
                 'ContentType': 'text/html',
                 'ACL': 'public-read'})

Get website URL

http://gid-reports.s3.amazonaws.com/index.html

SNS - Simple Notification Service

Publisher –> SNS Topic –> SNS Subscriber (via Email or Text)

Each SNS has a unique ARN (Amazon Resource Name). Each Subscription has a unique ID.

sns = boto3.client('sns',
                   region_name='us_east-1',
                   aws_access_key_id=AWS_KEY_ID,
                   aws_secret_access_key=AWS SECRET)

response = sns.create_topic(Name='city_alerts')

topic_arn = response['TopicArn]

create_topic() method is idempotent, so won’t recreate a topic if it already exists

Listing Topics

sns.list_topics()

Deleting Topics

sns.delete_topic(TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts')

# Get the current list of topics
topics = sns.list_topics()['Topics']

for topic in topics:
  # For each topic, if it is not marked critical, delete it
  if "critical" not in topic['TopicArn']:
    sns.delete_topic(TopicArn=topic['TopicArn'])
    
# Print the list of remaining critical topics
print(sns.list_topics()['Topics'])

Each subscription has a unique ID, Endpoint (phone number or email address where message sent), Status (Confirmed or pending confirmation) and Protocol (Email or SMS).

sns = boto3.client('sns',
                   region_name='us_east-1',
                   aws_access_key_id=AWS_KEY_ID,
                   aws_secret_access_key=AWS SECRET)

response = sns.subscribe(
  TopicArn = 'arn:aws:sns:us-east-1:320333787981:city_alerts',
  Protocol = 'SMS',
  Endpoint = '+13125551123')

Subscriptions automatically confirmed for SMS, but will be ‘Pending Confirmation’ for email until user confirms.

Listing Subscriptions

sns.list_subscriptions_by_topic(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts')

sns.list_subscriptions()['Subscriptions']

Deleting Subscriptions

sns.unsubscribe(
    SubscriptionArn='arn:aws:sns:us-east-1:320333787981:city_alerts:9f2dad1d-844')

Deleting multiple subscriptions

response = sns.list_subscriptions_by_topic(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts')

subs = response['Subscriptions']

for sub in subs:
  if sub['Protocol'] = 'sms':
    sns.unsubscribe(sub['SubscriptionArn'])

Sending Messages

Publishing to a topic:

response = sns.publish(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts',
  Message='Body of SMS or e-mail',
  Subject='Subject Line for Email'
)

Sending custom messages:

num_of_reports = 137

response = client.publish(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts',
  Message='There are {} reports outstanding'.format(num_of_reports),
  Subject='Subject Line for Email'
)

Sending a single SMS without Topic or Subscriber:

response = sns.publish(
  PhoneNumber='+13121233211',
  Message='Body text of SMS or e-mail'
)

Good for one off, but not good long term practice

Case Study: Building a notification system

1) Topic Set Up

Create topics for each service

sns = boto3.client('sns',
                   region_name='us-east-1',
                   aws_access_key_id=AWS_KEY_ID,
                   aws_secret_access_key=AWS_SECRET)

trash_arn = sns.create_topic(Name='trash_notifications')['TopicArn']
streets_arn = sns.create_topic(Name='streets_notifications')['TopicArn']

Download the contact list CSV

contacts = pd.read_csv('http://gid-staging.s3.amazonaws.com/contacts.csv')

Subscribe the contacts to their respective

def subscribe_user(user_row):
   if user_row['Department'] == 'trash':
      sns.subscribe(TopicArn=trash_arn, Protocol='sms', Endpoint=str(user_row['Phone']))
      sns.subscribe(TopicArn=trash_arn, Protocol='email', Endpoint=str(user_row['Email']))
   else:
      sns.subscribe(TopicArn=streets_arn, Protocol='sms', Endpoint=str(user_row['Phone']))
      sns.subscribe(TopicArn=streets_arn, Protocol='email', Endpoint=str(user_row['Email']))

contacts.apply(subscribe_user, axis=1)

2) Get aggregated numbers

Download monthly report

df = pd.read_csv('http://gid-reports.s3.amazonaws.com/2019/feb/final_report.csv')

Get count of potholes and illegal dumpings

df.set_index('service_name', inplace=True)

trash_violations_count = df.at['Illegal Dumping', 'count']
streets_violations_count = df.at['Pothole', 'count']

3) Send alerts

If potholes exceeds 100, send alert
If illegal dumping exceeds 30, send alert

if trash_violations_count > 100:

   message = "Trash violations count is now {}".format(trash_violations_count)

   sns.publish(TopicArn=trash_arn,
               Message=message,
               Subject="Trash Alert")

if streets_violations_count > 30:

   message = "Streets violations count is now {}".format(streets_violations_count)

   sns.publish(TopicArn=streets_arn,
               Message=message,
               Subject="Streets Alert")

Computer Vision: AWS Rekognition

Boto3 follows the same pattern for all AWS services.

Rekognition is a computer vision API by AWS. Uses include: detecting objects in an image and extracting text from images.

Upload an image to S3:

# Initialise S3 Client
s3 = boto3.client(
  's3', region_name='us-east-1',
  aws_access_key_id=AWS_KEY_ID,
  aws_secret_access_key=AWS_SECRET
)

# Upload file
s3.upload_file(
  Filename='report.jpg',
  Key='report.jpg',
  Bucket='datacamp-img')

Object detection:

# Construct Rekogition Client
rekog = boto3.client(
  'rekognition',
  region_name='us-east-1',
  aws_secret_key_id=AWS_KEY_ID,
  aws_secret_access_key=AWS_SECRET)

# Call detect_labels method
response = rekog.detect_labels(
   Image={'S3Object': {
             'Bucket': 'datacamp-img',
             'Name': 'report.jpg'}
         },
   MaxLabels=10,
   MinConfidence=95
)

Text Detection:

response = rekog.detect_text(
   Image={'S3Object': {
              'Bucket': 'datacamp-img',
              'Name': 'report.jpg'}
          }
)

Returns “line” (rows of text) and “word” detections (individual words).

NLP: AWS Translate, AWS Comprehend

Translating text:

# Initialise client
translate = boto3.client('translate',
                         region_name='us-east-1',
                         aws_access_key_id=AWS_KEY_ID,
                         aws_secret_access_key=AWS_SECRET)

# Translate Text
response = translate.translate_text(
              Text='Hello, how are you?',
              SourceLanguageCode='auto',
              TargetLanguageCode='es')

translated_text = response['TranslatedText']

Detecting Language:

# Initialise client
comprehend = boto3.client('comprehend',
                          region_name='us-east-1',
                          aws_access_key_id=AWS_KEY_ID,
                          aws_secret_access_key=AWS_SECRET)

# Detect dominant language
response = comprehend.detect_dominant_language(
  Text="Hay basura por todas partes a lo largo de la carretera."
)

Understanding Sentiment:

# Detect text sentiment
response = comprehend.detect_sentiment(
   Text="DataCamp students are amazing.",
   LanguageCode='en')

sentiment = response['Sentiment]

Case Study: Detecting sentiment about e-scooter blocking the sidewalk

```python

Initialise Boto3 Clients

rekog = boto3.client(‘rekognition’, region_name=’us-east-1’, aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)

comprehend = boto3.client(‘comprehend’, region_name=’us-east-1’, aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)

translate = boto3.client(‘translate’, region_name=’us-east-1’, aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)

Translate all descriptions into English

for index, row in df.iterrows(): desc = df.loc[index, ‘public_description’] if desc != ‘’: resp = translate.translate_text( Text=desc, SourceLanguageCode=’auto’, TargetLanguageCode=’en’) df.loc[index, ‘public_descriprion’] = resp[‘TranslatedText’]

Detect text sentiment

for index, row in df.iterrows(): desc = df.loc[index, ‘public_descriprion’] if desc != ‘’: resp = comprehend.detect_sentiment( Text=desc, LanguageCode=’en’) df.loc[index, ‘sentiment’] = resp[‘Sentiment’]

Detect scooter in image

df[‘img_scooter’] = 0 for index, row in df.iterrows(): image = df.loc[index, ‘image’] response = rekog.detect_labels( Image={‘S3Object’: {‘Bucket’: ‘gid-images’, ‘Name’: image}}) for label in response[‘Labels’]: if label[‘Name’] == ‘Scooter’: df.loc[index, ‘img_scooter’] = 1 break

Select only rows where there was a scooter image and negative sentiment

pickups = df[((df.img_scooter == 1) & (df.sentiment == ‘NEGATIVE’))]

num_pickups = len(pickups)

Emile's Notes

Data Science/Programming notes

Introduction to AWS Boto in Python

Intro to AWS and Boto3

Creating an IAM User

AWS Services

Diving into buckets

What can we do with buckets using Boto3?

Uploading and Retrieving Files

Upload files

List objects in a bucket

Get object metadata

Download file

Delete objects

ACLs

Accessing public objects

Downloading a private file

Pre-signed URLs

Case Study: Generating a Report Repository

Listing Topics

Deleting Topics

Listing Subscriptions

Deleting Subscriptions

Deleting multiple subscriptions

Sending Messages

Case Study: Building a notification system

Computer Vision: AWS Rekognition

NLP: AWS Translate, AWS Comprehend

Case Study: Detecting sentiment about e-scooter blocking the sidewalk

Initialise Boto3 Clients

Translate all descriptions into English

Detect text sentiment

Detect scooter in image

Select only rows where there was a scooter image and negative sentiment

Introduction to AWS Boto in Python

Intro to AWS and Boto3

Creating an IAM User

AWS Services

Diving into buckets

What can we do with buckets using Boto3?

Uploading and Retrieving Files

Upload files

List objects in a bucket

Get object metadata

Download file

Delete objects

Sharing Files Securely

ACLs

Accessing public objects

Downloading a private file

Pre-signed URLs

Sharing files through a website

Case Study: Generating a Report Repository

SNS Topics

Creating an SNS Topic

Listing Topics

Deleting Topics

SNS Subscriptions

Listing Subscriptions

Deleting Subscriptions

Deleting multiple subscriptions

Sending Messages

Case Study: Building a notification system

Computer Vision: AWS Rekognition

NLP: AWS Translate, AWS Comprehend

Case Study: Detecting sentiment about e-scooter blocking the sidewalk

Initialise Boto3 Clients

Translate all descriptions into English

Detect text sentiment

Detect scooter in image

Select only rows where there was a scooter image and negative sentiment