View on GitHub

Emile's Notes

Data Science/Programming notes

Introduction to AWS Boto in Python

Intro to AWS and Boto3

AWS provides storage, compute and alerts services that we can leverage in data projects. AWS services are granular, meaning they can work together, or on their own.

To interact with AWS services using Python we can use the Boto3 library:

import boto3

s3 = boto3.client(
        's3',
        region_name='us-east-1',
        aws_access_key_id=AWS_KEY_ID,
        aws_secret_access_key=AWS_SECRET)

response = s3.list_buckets()

Creating an IAM User

IAM - Identity Access Management

To create an IAM role, we can search for IAM in the AWS Management Console. Selecting Users followed by Add User allows us to input a User name and select Programmatic Access, to allow us to use a access key/secret to connect via Boto3.

We can select ‘Attach existing policies directly’ and search for the policy we wish to attach, e.g. s3fullaccess.

We will then be provided with an Access key ID and a Secret access key that we can use to programmatically interact with AWS.

AWS Services

S3 (Simple Storage Service) - Lets us store files in the cloud.

SNS (Simple Notification Service) - Lets us send emails and texts to alert subscribers based on events and conditions in our data pipelines.

Comprehend - Performs sentiment analysis on blocks of text.

Recognition - Extracts text from images.


Diving into buckets

S3 allows us to place files in the cloud, accessible anywhere by a url.

Buckets - Analogous to Desktop folders

Objects - Analogous to files

What can we do with buckets using Boto3?

import boto3

# Create boto3 client
s3 = boto.client(
            's3',
            region_name='us-east-1',
            aws_access_key=AWS_KEY_ID,
            aws_secret_access_key=AWS_SECRET)

# Create bucket
bucket = s3.create_bucket(Bucket='gid-requests)
import boto3

s3 = boto.client(
            's3',
            region_name='us-east-1',
            aws_access_key=AWS_KEY_ID,
            aws_secret_access_key=AWS_SECRET)

# List buckets as dict
bucket_response = s3.list_buckets()
import boto3

s3 = boto.client(
            's3',
            region_name='us-east-1',
            aws_access_key=AWS_KEY_ID,
            aws_secret_access_key=AWS_SECRET)

# Delete bucket
response = s3.delete_bucket('gid-requests')

Uploading and Retrieving Files

An object can be anything: image, video file, csv, log file etc.

Objects and buckets analogous to files and folders on a desktop.

Bucket Object
A bucket has a name An object has a key
Name is a string Name is full path from bucket root
Unique name in all of s3 Unique key in bucket
Contains many objects Can only be in one parent bucket


Upload files

s3.upload_file(
    Filename='gid_requests_2019_01_01.csv', # local file path
    Bucket='gid-requests', # name of bucket we're uploading to
    Key='gid_requests_2019_01_01.csv') # name in s3 

List objects in a bucket

response = s3.list_objects(
                Bucket='gid-requests',
                MaxKeys=2, # limit response to n objects
                Prefix='gid_requests_2019_') # limit to objects starting with prefix

Get object metadata

response = s3.head_objects(
                Bucket='gid-requests',
                Key='gid_requests_2018_12_30.csv)

Download file

s3.download_file(
    Filename='gid_requests_downed.csv', # local file path to download to
    Bucket='gid-requests',
    Key='gid_requests_2018_12_30.csv')

Delete objects

s3.delete_object(
    Bucket='gid-requests',
    Key='gid_requests_2018_12_30.csv')

Sharing Files Securely

AWS defaults to denying permission, so we must explicitly grant access in order to access objects/buckets.

AWS Permissions Systems:

ACLs

Entities attached to objects in s3. Common types include: 'public-read' and 'private'

# By default, when unspecified ACL is 'private'
s3.upload_file(
  Filename='potholes.csv', Bucket='gid-requests', Key='potholes.csv')

# Set ACL to 'public-read'
s3.put_object_acl(
  Bucket='gid-requests', Key='potholes.csv', ACL='public-read')

Setting ACLs on upload:

s3.upload_file(
    Bucket='gid-requests',
    Filename='potholes.csv',
    Key='potholes.csv',
    ExtraArgs{'ACL':'public-read'})

Accessing public objects

Publicly accessible s3 objects can be accessed using the url template:

https://{bucket}.s3.amazonaws.com/{key}

e.g. https://gid-requests.s3.amazonaws.com/2019/potholes.csv

Generating public objects URL:

url = "https://{}.s3.amazonaws.com/{}".format(
  "gid-requests",
  "2019/potholes.csv")

df = pd.read_csv(url)

Downloading a private file

# Download file
s3.download_file(
  Filename='potholes_local.csv',
  Bucket='gid-staging',
  Key='2019/potholes_private.csv')

# Read from Disk
pd.read_csv('./potholes_local.csv')

In memory read of file:

obj = s3.get_object(Bucket='gid-requests', Key='2019/potholes.csv')

pd.read_csv(obj['Body']) # Read StreamingBody object into Pandas

Pre-signed URLs

Grant temporary access to s3 objects, with an expiring timeframe.

# Generate Presigned URL

share_url = s3.generate_presigned_url(
  ClientMethod='get_object',
  ExpiresIn=3600, # grant access for one hour
  Params={'Bucket':'gid-requests', 'Key'='potholes.csv'}
)

pd.read_csv(share_url)

Load multiple files into one DataFrame:

# Create list to hold our DataFrames
df_list = []


# Request the list of csv's from S3 with prefix; Get contents
response = s3.list_objects(
  Bucket='gid-requests', 
  Prefix='2019/')


# Get response contents
request_files = response['Contents']

# Iterate over each object
for file in request_files:
  obj = s3.get_object(Bucket='gid-requests', Key=file['Key'])

  #Read it as DataFrame
  obj_df = pd.read_csv(obj['Body'])

  # Append DataFrame to list
  df_list.append(obj_df)

# Concatenate all the DataFrames in the list
df = pd.concat(df_list)

Sharing files through a website

Converting a DataFrame to html:

df.to_html('table_agg.html',
           render_links=True,
           columns['service_name', 'request_count', 'info_link'],
           borders=0)

Upload an HTML file to S3:

s.upload_file(
  Filename='./table_agg.html',
  Bucket='datacamp-website',
  Key='table.html',
  ExtraArgs = {
    'ContentType': 'text/html',
    'ACL': 'public-read'}
)

Accessing HTML file:

https://{bucket}.s3.amazonaws.com/{key}

https://datacamp-website.s3.amazonaws.com/table.html

Uploading an image file:

s3.upload_file(
  Filename='./plot_image.png',
  Bucket='datacamp-website',
  Key='plot_image.png',
  ExtraArgs = {
    'ContentType': 'image/png',
    'ACL': 'public-read'}
)

IANA Media Types:

Full list of IANA media types can be found at, http://www.iana.org/assignments/media-types/media-types.xhtml.

Common types include: JSON, application/json; PNG, image/png; PDF, application/pdf; CSV, text/csv.

Generating an index page:

# List the gid-reports bucket objects starting with 2019/
r = s3.list_objects(BUcket='gid-reports', Prefix='2019/')

# Convert the response contents to DataFrame
objects_df = pd.DataFrame(r['Contents'])

# Create column "Link" that contains website url + key
base_url = 'http://datacamp-website.s3.amazonaws.com/'
objects_df['Link'] = base_url + objects_df['Key']

# Write DataFrame to html
objects_df.to_html('report_listing.html',
                   columns=['Link', 'LastModified', 'Size'],
                   render_links=True)

# Upload HTML file to S3
s3.upload_file(
  Filename='./report_listing.html',
  Bucket='datacamp-website',
  Key='index.html',
  ExtraArgs = {
    'ContentType': 'text/html',
    'ACL': 'public-read'}
)

Case Study: Generating a Report Repository

1) Prepare the data

# Create list to hold our DataFrames
df_list = []

# Request the list of CSVs from S3 with prefix; Get contents
response = s3.list_objects(
  Bucket='gid-requests',
  Prefix='2019_jan')

# Get response contents
request_files = response['Contents]
# Iterate over each object
for file in request_files:
   obj = s3.get_object(Bucket='gid-requests', Key=file['Key'])

   # Read it as DataFrame
   obj_df = pd.read_csv(obj['Body'])

   # Append DataFrame to list
   df_list.append(obj_df)

# Concatenate all dfs in list
df = pd.concat(df_list)

1) Create the Report

# Write agg_df to a CSV and HTML file with no border
agg_df.to_csv('./jan_final_report.csv')
agg_df.to_html('./jan_final_report.html', border=0)

1) Upload report to shareable website

# Upload Aggregated CSV to S3
s3.upload_file(Filename='./jan_final_report.csv',
               Key='2019/jan/final_report.csv',
               Bucket='gid-reports',
               ExtraArgs = {'ACL': 'public-read'})

# Upload HTML table to S3
s3.upload_file(Filename='./jan_final_report.html',
               Key='2019/jan/final_report.html',
               Bucket='gid-reports',
               ExtraArgs = {
                 'ContentType': 'text/html',
                 'ACL': 'public-read'})

# Upload Aggregated Chart to S3
s3.upload_file(Filename='./jan_final_chart.html',
               Key='2019/jan/final_chart.html',
               Bucket='gid-reports',
               ExtraArgs = {
                 'ContentType': 'text/html',
                 'ACL': 'public-read'})

# List the gid-reports bucket objects starting with 2019/
r = s3.list_objects(Bucket='gid-reports', Prefix='2019/')

# Convert the response contents to DataFrame
objects_df = pd.DataFrame(r['Contents'])

# Create a column "Link" that contains website url + key
base_url = "https://gid-reports.s3.amazonaws.com/"
objects_df['Link'] = base_url + objects_df['Key']

# Write DataFrame to html
objects_df.to_html('report_listing.html',
                   columns=['Link', 'LastModified', 'Size'],
                   render_links=True)

# Upload the file to gid-reports bucket root
s3.upload_file(Filename='./report_listing.html',
               Key='index.html',
               Bucket='gid-reports',
               ExtraArgs = {
                 'ContentType': 'text/html',
                 'ACL': 'public-read'})

http://gid-reports.s3.amazonaws.com/index.html

SNS Topics

SNS - Simple Notification Service

Publisher –> SNS Topic –> SNS Subscriber (via Email or Text)

Each SNS has a unique ARN (Amazon Resource Name). Each Subscription has a unique ID.

Creating an SNS Topic

sns = boto3.client('sns',
                   region_name='us_east-1',
                   aws_access_key_id=AWS_KEY_ID,
                   aws_secret_access_key=AWS SECRET)

response = sns.create_topic(Name='city_alerts')

topic_arn = response['TopicArn]

Listing Topics

sns.list_topics()

Deleting Topics

sns.delete_topic(TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts')
# Get the current list of topics
topics = sns.list_topics()['Topics']

for topic in topics:
  # For each topic, if it is not marked critical, delete it
  if "critical" not in topic['TopicArn']:
    sns.delete_topic(TopicArn=topic['TopicArn'])
    
# Print the list of remaining critical topics
print(sns.list_topics()['Topics'])

SNS Subscriptions

Each subscription has a unique ID, Endpoint (phone number or email address where message sent), Status (Confirmed or pending confirmation) and Protocol (Email or SMS).

sns = boto3.client('sns',
                   region_name='us_east-1',
                   aws_access_key_id=AWS_KEY_ID,
                   aws_secret_access_key=AWS SECRET)

response = sns.subscribe(
  TopicArn = 'arn:aws:sns:us-east-1:320333787981:city_alerts',
  Protocol = 'SMS',
  Endpoint = '+13125551123')

Subscriptions automatically confirmed for SMS, but will be ‘Pending Confirmation’ for email until user confirms.

Listing Subscriptions

sns.list_subscriptions_by_topic(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts')
sns.list_subscriptions()['Subscriptions']

Deleting Subscriptions

sns.unsubscribe(
    SubscriptionArn='arn:aws:sns:us-east-1:320333787981:city_alerts:9f2dad1d-844')

Deleting multiple subscriptions

response = sns.list_subscriptions_by_topic(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts')

subs = response['Subscriptions']

for sub in subs:
  if sub['Protocol'] = 'sms':
    sns.unsubscribe(sub['SubscriptionArn'])

Sending Messages

Publishing to a topic:

response = sns.publish(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts',
  Message='Body of SMS or e-mail',
  Subject='Subject Line for Email'
)

Sending custom messages:

num_of_reports = 137

response = client.publish(
  TopicArn='arn:aws:sns:us-east-1:320333787981:city_alerts',
  Message='There are {} reports outstanding'.format(num_of_reports),
  Subject='Subject Line for Email'
)

Sending a single SMS without Topic or Subscriber:

response = sns.publish(
  PhoneNumber='+13121233211',
  Message='Body text of SMS or e-mail'
)

Case Study: Building a notification system

1) Topic Set Up

sns = boto3.client('sns',
                   region_name='us-east-1',
                   aws_access_key_id=AWS_KEY_ID,
                   aws_secret_access_key=AWS_SECRET)

trash_arn = sns.create_topic(Name='trash_notifications')['TopicArn']
streets_arn = sns.create_topic(Name='streets_notifications')['TopicArn']
contacts = pd.read_csv('http://gid-staging.s3.amazonaws.com/contacts.csv')
def subscribe_user(user_row):
   if user_row['Department'] == 'trash':
      sns.subscribe(TopicArn=trash_arn, Protocol='sms', Endpoint=str(user_row['Phone']))
      sns.subscribe(TopicArn=trash_arn, Protocol='email', Endpoint=str(user_row['Email']))
   else:
      sns.subscribe(TopicArn=streets_arn, Protocol='sms', Endpoint=str(user_row['Phone']))
      sns.subscribe(TopicArn=streets_arn, Protocol='email', Endpoint=str(user_row['Email']))

contacts.apply(subscribe_user, axis=1)

2) Get aggregated numbers

df = pd.read_csv('http://gid-reports.s3.amazonaws.com/2019/feb/final_report.csv')
df.set_index('service_name', inplace=True)

trash_violations_count = df.at['Illegal Dumping', 'count']
streets_violations_count = df.at['Pothole', 'count']

3) Send alerts

if trash_violations_count > 100:

   message = "Trash violations count is now {}".format(trash_violations_count)

   sns.publish(TopicArn=trash_arn,
               Message=message,
               Subject="Trash Alert")

if streets_violations_count > 30:

   message = "Streets violations count is now {}".format(streets_violations_count)

   sns.publish(TopicArn=streets_arn,
               Message=message,
               Subject="Streets Alert")

Computer Vision: AWS Rekognition

Boto3 follows the same pattern for all AWS services.

Rekognition is a computer vision API by AWS. Uses include: detecting objects in an image and extracting text from images.

Upload an image to S3:

# Initialise S3 Client
s3 = boto3.client(
  's3', region_name='us-east-1',
  aws_access_key_id=AWS_KEY_ID,
  aws_secret_access_key=AWS_SECRET
)

# Upload file
s3.upload_file(
  Filename='report.jpg',
  Key='report.jpg',
  Bucket='datacamp-img')

Object detection:

# Construct Rekogition Client
rekog = boto3.client(
  'rekognition',
  region_name='us-east-1',
  aws_secret_key_id=AWS_KEY_ID,
  aws_secret_access_key=AWS_SECRET)

# Call detect_labels method
response = rekog.detect_labels(
   Image={'S3Object': {
             'Bucket': 'datacamp-img',
             'Name': 'report.jpg'}
         },
   MaxLabels=10,
   MinConfidence=95
)

Text Detection:

response = rekog.detect_text(
   Image={'S3Object': {
              'Bucket': 'datacamp-img',
              'Name': 'report.jpg'}
          }
)

Returns “line” (rows of text) and “word” detections (individual words).

NLP: AWS Translate, AWS Comprehend

Translating text:

# Initialise client
translate = boto3.client('translate',
                         region_name='us-east-1',
                         aws_access_key_id=AWS_KEY_ID,
                         aws_secret_access_key=AWS_SECRET)

# Translate Text
response = translate.translate_text(
              Text='Hello, how are you?',
              SourceLanguageCode='auto',
              TargetLanguageCode='es')

translated_text = response['TranslatedText']

Detecting Language:

# Initialise client
comprehend = boto3.client('comprehend',
                          region_name='us-east-1',
                          aws_access_key_id=AWS_KEY_ID,
                          aws_secret_access_key=AWS_SECRET)

# Detect dominant language
response = comprehend.detect_dominant_language(
  Text="Hay basura por todas partes a lo largo de la carretera."
)

Understanding Sentiment:

# Detect text sentiment
response = comprehend.detect_sentiment(
   Text="DataCamp students are amazing.",
   LanguageCode='en')

sentiment = response['Sentiment]

Case Study: Detecting sentiment about e-scooter blocking the sidewalk

```python

Initialise Boto3 Clients

rekog = boto3.client(‘rekognition’, region_name=’us-east-1’, aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)

comprehend = boto3.client(‘comprehend’, region_name=’us-east-1’, aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)

translate = boto3.client(‘translate’, region_name=’us-east-1’, aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)

Translate all descriptions into English

for index, row in df.iterrows(): desc = df.loc[index, ‘public_description’] if desc != ‘’: resp = translate.translate_text( Text=desc, SourceLanguageCode=’auto’, TargetLanguageCode=’en’) df.loc[index, ‘public_descriprion’] = resp[‘TranslatedText’]

Detect text sentiment

for index, row in df.iterrows(): desc = df.loc[index, ‘public_descriprion’] if desc != ‘’: resp = comprehend.detect_sentiment( Text=desc, LanguageCode=’en’) df.loc[index, ‘sentiment’] = resp[‘Sentiment’]

Detect scooter in image

df[‘img_scooter’] = 0 for index, row in df.iterrows(): image = df.loc[index, ‘image’] response = rekog.detect_labels( Image={‘S3Object’: {‘Bucket’: ‘gid-images’, ‘Name’: image}}) for label in response[‘Labels’]: if label[‘Name’] == ‘Scooter’: df.loc[index, ‘img_scooter’] = 1 break

Select only rows where there was a scooter image and negative sentiment

pickups = df[((df.img_scooter == 1) & (df.sentiment == ‘NEGATIVE’))]

num_pickups = len(pickups)