full interview

Systems Engineer

Systems Engineer –  Operations Engineering

Who we are:

EC2 OpsEng is at the heart of the leading Infrastructure-as-a-Service cloud computing platform. We manage the thundering herd of servers and systems that comprise EC2, using a service-oriented approach to large-scale systems management.

You’ll be part of a world-class team in a fast-paced environment that has the entrepreneurial feel of a start-up. This is an opportunity to operate and engineer systems on a massive scale, and to gain top-notch experience in cloud computing.

Who you are:

You have a passion for massive platforms. You want to know what makes them tick, to take their pulse and measure, and to drive them. You understand operating systems, networking, software and services and speak the language of scripting for systems administration. You have a passion for large scale automation. You live and breathe operational risk management. You understand technical debt. You know success is measured by Customers.

You + EC2 OpsEng:

You’ll be surrounded by wickedly smart software developers and systems engineers who are passionate about cloud computing, and believe that world class software is critical to customer success. You’ll become a master at designing and developing services to manage the EC2 platform. You’ll be engaged on so many interesting things that you’ll think you’re at a start-up.

You’ll be surrounded by people who are wickedly smart, passionate about cloud computing, and believe that world class service is critical to customer success. You’ll become a master at EC2 platform diagnosis, response, measurement, and automation. You will design and build the operational scalability that sustains the platform’s insane growth. You will measure your success and it will be visible. You’ll be so engaged on so many interesting things, you’ll think you’re at a startup.

Basic qualifications:

  • Expertise administering Linux operating systems. You have to be really good at this.
  • Experience scripting and programming for systems administration using one or more of the following: perl, python, ruby, shell
  • Experience with metrics acquisition and analysis, including log processing and regular expressions programming
  • Knowledge of TCP/IP networking fundamentals

Preferred qualifications:

  • Bachelor’s degree in computer science, computer engineering or related technical discipline
  • Experience developing web-based system monitoring and mitigation tools
  • SQL programming experience
  • Excel analytics and presentation experience
  • Network administration experience
  • Start-up experience
  • Experience with systems administration in Xen or another virtualized environment


Software Development Engineer

Software Development Engineer – Operations Engineering

Who we are:

EC2 OpsEng is at the heart of the leading Infrastructure-as-a-Service cloud computing platform. We manage the thundering herd of servers and systems that comprise EC2, using a service-oriented approach to large-scale systems management.

You’ll be part of a world-class team in a fast-paced environment that has the entrepreneurial feel of a start-up. This is an opportunity to develop management systems on a massive scale, and to gain top-notch experience in cloud computing.

Who you are:

You have a passion for massive platforms. You want to know what makes them tick, to take their pulse and measure, and to drive them. You understand operating systems, networking, software and services and speak the language of software development for large-scale services. You have a passion for problems of scale. You consider operational risk management in your software designs. You understand technical debt and how to manage complexity. You know success is measured by Customers.

You + EC2 OpsEng:

You’ll be surrounded by wickedly smart software developers and systems engineers who are passionate about cloud computing, and believe that world class software is critical to customer success. You’ll become a master at designing and developing services to manage the EC2 platform. You’ll be engaged on so many interesting things that you’ll think you’re at a start-up. Project areas include:

  • Workflow services for automated issue remediation
  • Detection and analysis services for identifying issues
  • Data vending for serving operational information to a broad audience of consumers

Basic qualifications:

  • Experience developing service-oriented architectures in one or more of the following: java, ruby, perl
  • Experience developing web-based system monitoring and mitigation tools
  • Experience administering Linux operating systems, and scripting system administration tools
  • Knowledge of TCP/IP networking fundamentals

Preferred qualifications:

  • Bachelor’s degree in computer science, computer engineering or related technical discipline
  • Experience working with web-based applications at massive scale
  • SQL programming
  • Network administration
  • Start-up experience


Resume Review Checklist

If resume does not exhibit all of these, then pass
  • scripting for data analysis or systems administration (preferably perl, php, python, ruby, powershell, VB, shell if of reasonable complexity)
  • problem solving on platform of significant scale/complexity (not help desk nor I/T)
  • Expert Linux systems admin (not help desk nor I/T), including one or more of:
    • OS build and installation, packaging and software installation at scale
    • networking configuration and diagnostics, load balancing, security, tunneling, linux-based
    • infrastructure configuration and management (DNS, Active Directory, LDAP, account management)
    • virtualization preferably Xen
    • monitoring system configuration and extension, scripting (cacti, nagios, ganglia, munin, etc.)
If resume shows any of these, then that is a plus
  • production use of AWS
  • metrics awareness, metrics development
  • service oriented architecture
  • supporting 500+ servers
  • entrepreneurial experience
  • BS or advanced degree in CS
  • performance tuning
  • web programming, building web-db applications to support operations
  • service provider (ISP, NSP) experience
  • Worked at: google, supercomputing center, microsoft azure
  • Datamart, Extract-Transform-Load scripting (ETL), OLAP, Excel analytics
  • iptables
  • significant home computing infrastructure or other interesting project/academic work
  • lots of certifications
  • compliance and auditing (PCI, SOX, SAS70)
If resume shows any of these, consider it a red flag
  • Many short-duration assignments; have recruiter ask about this if resume is otherwise interesting
  • I/T and help desk
  • account management, home directory management, mail management
  • backups
  • small cog in big machine; working on the same thing for long time in large organization
  • purchasing and vendor focus, inventory management
  • rack and stack
  • exclusively enterprise experience
  • Can’t tell what they personally did on the projects described, heavy description of project; lots of “weasel words” (“helped with”, “involved with”)
Windows-centric candidates
EC2 OpsEng is considering Windows-centric candidates with the following qualifications:
  • Solid Linux skills since pretty much all diagnosis and action takes place in Linux
  • Demonstrated Windows experience at a systems level. Should be able to diagnose network, storage system, OS issues using perfmon at minimum.
  • Experience with one or more of the following:
    • Active Directory
    • Exchange

Phone Screening

Select questions from the categories below, covering all areas in all phone screens. Make sure your phone screen feedback contains enough detail to allow subsequent screeners to easily determine:

  • questions you’ve already asked
  • suggestions as to particular areas to probe and specific questions to ask
  • subsequent screens should validate previous screeners observations while further diving deeper into candidates aptitude and attitude.

If candidate “gets it” during first phone screen, then first screener provides sample code problem and second screener will receive the result.

The Basics (easy questions to establish interview viability)

Use these early on in the PS to figure out if the conversation is worth continuing. If candidate can’t get through some of these, big red flag.

Bits and Bytes
  • How many distinct values can a byte store? 0 -> 255
  • How may positive values can a signed byte store? -128 -> 127
  • Describe RAID 0, 1, 1+0, 0+1, 5 and advantages, disadvantages
How do you make sure the website is running?

This is a really general question to see what kind of operations experience the candidate has and how they can convey a web services operational design.

  • Describe a web service architecture with CDN and walk through a customer transaction at a high level.
  • Multiple areas on monitoring: End-to-end (hopefully external network siteping), host hardware metrics, logscan, fleet-wide checking
  • Might talk about automatic recovery of certain alarming
  • Dashboards, graphs

Linux Runlevels

dwyerm@ feels this knowledge isn’t key for a good SE.

    • 0 = Halt.
    • 1 = Single user.
    • 2 = User defined.
    • 3 = Multi user network.
    • 4 = User Defined.
    • 5 = Multi user X session.
    • 6 = Reboot.

Linux File systems

    • What is the difference between EXT2 and EXT3?
      • EXT3 is a journalized file system.
    • Explain how a journalized filesystem works.
      • File system keeps a record of all writes to the actual disk. This can be used in data recovery situation to redo all the transactions.
    • What is an inode
      • A block on the hard drive that contains the pointer to the file and all it attributes ( size, permissions, etc… )
    • What the difference between a symbolic link and a hard link?
      • Symbolic link creates a new inode and contains a pointer to the original file/directory.

What is your favourite Linux distro and why?

I would expect a reasonably passionate answer here if the candidate is a regular Linux user. The reasoning behind the answer is very important – something like : “I use Debian because the apt package management system is superior to other package management systems like yum….” The more detail you can get from a candidate on this introductory question, the better. Zealotry is allowed here, so long as its backed up by some valid reasoning.


    • why/when is it run?
      • usually on boot, particularly if you’ve crashed the server
      • runs on boot after a number of remounts set in the superblock
      • how do you set/change this number? mkfs, tunefs
      • can be manually run
      • how do you skip it on bootup?
        • flag file e.g. .fastboot or .autofsck
        • what can break?
      • how do you fsck when your root partition is dead?
        • boot from non-local medium e.g. PXE or CD
      • how do you recover from a bad superblock?
        • use alternate superblock
        • how do you find that?
          • mkfs -n (or equivalent “no-write” flag)

problem: “can’t log into server”

    • network responds (ping)
    • ssh doesn’t allow you in
      • is ssh running? how can you tell (remotely): telnet to port 22/ssh -v
    • reverse DNS not working (need to check exactly how ssh responds!)
    • MTU issues (tcpdump will spot this, also manually tweaking mtu)
    • user not in passwd file
    • user’s shell crippled
      • radius server unreachable
    • recover from one of the above

You get a trouble ticket that State my server is experiencing slow IO, what tools do you use?

    • iostat
    • vmstat

My disk is filling up!

You get an alarm indicating that the disk on a single host is filling up too rapidly

Properties of a good answer
  • Knows common reasons for disks filling up (logs, core dumps)
  • Knows how to find the actual disk filling up (df)
  • Knows how to find the full directory / file
  • Thinks about whether it’s okay to destroy the file
  • Takes the host out of service before working on it
  • Checks for a system-wide failure
  • Knows about filehandles being held open (knows how to find the process with the file handle)
  • They know different servers hold filehandles differently for their log files

The disk reports that it is full, but there is plenty of free space. (Answer: inodes?)

Why is the database getting overloaded?

Your database server is going nuts with CPU alarms. Applications that use it are working, but are slow. How do you find out what’s wrong with it?

Good answer
  • Might look at netstat on the database port
  • Looks at open database conns vs available conns
  • Runs some database commands like “SHOW FULL PROCESSLIST;”
  • Might turn on query logging
  • Might look at timing data logging in the database clients

What does <linux tool X> do?

Tools to ask about (pick one or two)
  • netstat
  • iostat
  • ps
  • lsof
  • du
  • df

Explain built-in disk utilities and files and what they do

    • iostat – reports CPU statistics and I/O statistics for devices and partitions
    • scsi_info – SCSI Device descriptions
    • cat /proc/scsi/scsi – SCSI device lists
    • cat /proc/partitions – List of system partitions

What is vmstat?

    • Virtual memory statistics utility.


What is a for i loop?

    • Conditional loop that allows repeated execution (iteration loop).

What are the different memory (watermark) zones in linux?

Rockstar question – high free, low free, etc


What happens to a host when a process has a memory leak?

How do you deal with it? How do you ID the leaky process?

What is the difference between a process and a thread?

Address space stuff

How much memory can process use?

Lead-in to virtual memory

On a 32 bit machine with 8 GB of ram, how is it possible that 10 processes can use all of their 4GB at once?

What is virtual memory?

Reasons for it, etc

File permissions

    • How do they work, how do you change them, how do you elevate your permissions?
    • r/w/x, user/group/other
    • chmod, chown, sudo
  • Searching files
    • How do you search for a string in a file, in hundreds of files in a directory tree?
    • grep, zgrep, find, xargs, sort, uniq, cut
  • Redirecting output
    • Print unique list of postal codes from a CSV file (name, address, city, state, zip)?
    • |, <, >
  • Disk usage
    • How do you check, identify large files, what if removing files doesn’t free disk space (open file handles)?
    • find, grep, awk, cut
  • Process management
    • How do you check CPU/IO usage, what is CPU load, identify bad processes, stop process?
    • ps, top, iostat, vmstat, kill

What is Sticky Bit?

Mode 1000 <verbatim>#chmod o+t /home/share drwxrwxrwt root root share</verbatim> Setting the sticky bit prevents users from deleting each others files altho they have full access to the directory Unix directory access permissions specify that a person with write access to the directory can rename or remove files there — even files that don=t belong to the person. Many newer versions of Unix have a way to stop that. The owner of a directory can set its sticky bit (mode 1000). This means that now, the only people who can rename or remove any file in that directory are the file=s owner, the directory=s owner, and the superuser.

Discuss umask – what does it refer to and how does it work?

umask relates to a user=s default permissions for file creation. You can use the umask command to set the default mode for newly created files. Its argument is a three-digit numeric mode that represents the access to be inhibited (masked out) when a file is created. Thus, the value it wants is the octal complement of the numeric file mode you want. To determine this, you simply figure out the numeric equivalent for the file mode you want and then subtract it from 777. For example, to get the mode 751 by default, compute 777-751 = 026; this is the value you give to umask. <verbatim>#umask 026</verbatim> Once this command is executed, all future files created will be given this protection automatically. System administrators can put a umask command in the system initialization file to set a default for all users. You can set your own umask in your shell setup files to override defaults.

What is an inode? What information does an inode contain?

Inode = index node The inode object represents all the information needed by the kernel to manipulate a file or directory. The inode object is represented by struct inode and is defined in <linux/fs.h>. Here is (a sample of..) the structure, with comments describing each entry:

struct list_head        i_list;              /* list of inodes */
struct list_head        i_sb_list;           /* list of superblocks */
unsigned long           i_ino;               /* inode number */
unsigned int            i_nlink;             /* number of hard links 
uid_t                   i_uid;               /* user id of owner */
gid_t                   i_gid;               /* group id of owner */
loff_t                  i_size;              /* file size in bytes */
struct timespec         i_atime;             /* last access time */
struct timespec         i_mtime;             /* last modify time */
struct timespec         i_ctime;             /* last change time of inode */
unsigned int            i_blkbits;           /* block size in bits */
blkcnt_t                i_blocks;            /* file size in blocks */
umode_t                 i_mode;              /* access permissions */

Note that you can find when the permissions of a file last changed (even if the data in the file wasnt changed) by referencing ctime Note that when the hard link count makes it to zero, the inode and its associated data are deleted.

What does the “count” entry in an inode track?

The count value reflects how many times the file has been opened without being closed (in other words, how many references to the file are still active. This has some ramifications which aren=t obvious at first: you can delete a file so that no “filename” part points to the inode, without releasing the space for the data part of the file, because the file is still open. *Related:* Have you ever found yourself in this position: you notice that /var/log/messages (or some other syslog-owned file) has grown too big, and you rm /var/log/messages | touch /var/log/messages to reclaim the space, but the used space doesn=t reappear? Why is that? (This is because, although you=ve deleted the filename part, there=s a process that=s got the data part open still (syslogd), and the OS won=t release the space for the data until the process closes it. In order to complete your space reclamation, you have to do something like : <verbatim>kill -SIGHUP `cat /var/run/syslogd.pid`</verbatim> to get syslogd to close and reopen the file.

What does the sync command do on a Linux box?

sync – synchronizes data on disk with memory. sync writes any data buffered in memory out to disk. This can include (but is not limited to) modified superblocks, modified inodes, and delayed reads and writes. The sync program does nothing but exercise the sync system call. The kernel keeps data in memory to avoid doing (relatively slow) disk reads and writes. This improves performance, but if the computer crashes, data may be lost or the filesystem corrupted as a result. sync ensures that everything in memory is written to disk. sync should be called before the processor is halted in an unusual manner (e.g., before causing a kernel panic when debugging new kernel code). In general, the processor should be halted using the shutdown or reboot or halt commands, which will attempt to put the system in a quiescent state before calling sync.

What are the key open source tools in your toolkit?

Nagios (& Icinga) Infrastructure monitoring is a field that has so many solutions… from Zabbix to Nagios to dozens of other open-source tools
consul.io -Consul is a great fit for service discovery and configuration in modern, elastic applications that are built from microservices. The open-source tool makes use of the latest technology in providing internal DNS names for services.

What is the coolest open source tool or application you’ve discovered in the last 6 months?

What do you know about virtualisation technologies?

Quite an open question to assess awareness. Follow up by asking which products they’ve used, for what kinds of projects etc.

What is a hypervisor and what is its function?

A hypervisor is a program running on a physical machine to manage the virtual machines running on that physical machine. Essentially the hypervisor is a thin, privileged abstraction layer between the hardware and operating systems. It defines the virtual machine that guest domains see instead of physical hardware, it grants portions of the full physical resources to each guest, it exports simplified devices to guests, it enforces isolation among guests.

If the candidate has mentioned XEN in their experience, follow up with the following questions:

What is your understanding of domU?

A !DomU is the counterpart to !Dom0; it is an unprivileged domain with (by default) no access to the hardware. It must run a !FrontendDriver for multiplexed hardware it wishes to share with other domains. A !DomU is started by *xend* in !Dom0, which the user accesses with the xm command-line tool. The kernel for a !DomU comes from Dom0=s filesystem, not from the filesystem exported to the !DomU.



Can you tell me names of layers in OSI networking model?

  • Physical
  • Data Link
  • Network
  • Transport
  • Session
  • Presentation
  • Application

Explain in as much detail as you can the TCP 3 way Handshake

In addition to SYN | SYN ACK | ACK sequence, I would expect the candidate to reference or be aware of SRC and DST ports, the function and operation of sequence numbers, MSS, mid-handshake connection states (LISTENING, SYN_SENT, SYN_RCVD, ESTABLISHED)

What is BGP?

How does Traceroute Work

  • Host A sends an IP Datagram with TTL=1 to the destination host.
  • The first router decrements the TTL, discards the datagram and sends an ICMP “time exceeded” (with its own IP as the source) – we now have the first hop or the first router in the path.
  • Host A sends an IP Datagram with TTL=2 to the destination host and we figure out the IP of the 2nd hop/router as above.
  • This continues until the datagram reaches the destination host (with a TTL of 1)
  • Destination host does not discard or ICMP exceed as traceroute has sent the datagram to an unlikely port number (+30,000) meaning that the destination host’s UDP module returns a ICMP “port unreachable” error instead.

How can you tell if a site is under DDoS attack?

How do you mitigate it? Push candidate to work “out” the network (mod evasive, iptables, firewall, arbor in upstream provider).

What is the function of the TTL field?

The TTL value in a packet tells a router whether or not the packet has been in the network too long and should be discarded. In a nutshell, it helps prevent infinite routing loops.

What are the differences between TCP and UDP?

Properties of TCP
Delivery is in-order
Delivery is guaranteed (the sender knows if the packets weren’t delivered)
Throttling / flow control
An A+ answer will include a brief description of the sliding window and the ACK model
Properties of UDP
“Chuck-n-duck” delivery

Followup: Name some fields in the header in packets of each protocol

Followup 2: What is the TCP timestamp field used for?

How many hosts in a /23 network

/23 implies 23 consecutive 1’s in the netmask : 11111111.11111111.11111110.00000000 /23 = 2 x Class C networks. A Class C is denoted by the final octet in the mask above. (2^8) = 256 hosts. 2* = 512 minus the network address and broadcast address leaves…..

A: 510 usable hosts in a /23 A simpler, more elegant alternative : 32-23=9. ((2^9)-2) = 510

Given the ip address of with the subnet mask of how many addresses are available on the subnet and why?

Talk me through, in as much detail as you can, what happens when you connect to  domain.com from your browser

  • Resolve IP for www.adoain.com
  • Sends an ARP query to resolve the ethernet address for the configured name server
    • ARP broadcast query from your host to resolve the ethernet address for the name server.
    • Name Server responds with its ethernet address.
  • Host then sends a DNS query to the name server to resolve the hostname
    • port 53 is the =well-known= port for DNS
    • The request identification field is set to 7, the query is of type A (a query for host address), and the query name is “www.domainn.com”
    • The response to the host address query. 7* indicates that it is responding to the request for which the request identification field was 7 and that the response was authoritative.
    • The DNS query and response was sent over UDP, which is the recommended method for queries. UDP is a connectionless protocol so no handshaking is necessary to establish the connection.
  • Sends a HTTP request for the document to the server.
    • TCP connection establishment (3 way handshake)
    • Your host sends a TCP packet with the SYN flag set and sequence number 7861110 with 0 bytes in the data segment to port 80 on www.domain.com (port 80 is the =well-known= port for HTTP).
    • www.adomain.com responds with a TCP packet with the sequence number 3595122238 with 0 bytes in the data segment. Additionally, that TCP packet has the acknowledgement number set to 7861111 to acknowledge the receipt of the packet previously sent to hal by deep-thought (7861111 is derived from the sequence number of that packet = 7861110 + 1).
    • Your host to domain.com, acknowledgement number 3595122239 (3595122238 + 1), which completes the TCP handshake.
  • HTTP request
  • HTTP response

What are load-balancers, and how do they work?

How do you recover from a loadbalancer failure?

A major loadbalancer just failed, causing an ongoing major site outage. You have a different loadbalancer that is already configured and connected to the network. How do you get the site back up and running?

Section 5: Data Structures

Assess coding depth and program design. Emphasis on learning ability, conceptual understanding.

  • List common data structures. Arrays, queues, vectors, stacks, lists (linked, double-linked), binary trees, hashes. What is a heap and how is it different from a stack?
  • Define arrays and hash tables. How you traverse them to visit all elements. Typical implementation.
  • Array vs. Linked list vs. Vector? (Array is non-expandable, most efficient for managing lists of known size. Vector is synchronized, growable array; stores any object derived from object class. Linked List is items linked to adjacent items, managing queues or stacks.). ArrayList (Java). Worst-case insertion performance of each? How do you grow a vector? What happens you’re your hash table filles up? What’s performance of rehashing? http://www.javaranch.com/newsletter/June2002/listinterface.html
  • Implementing a priority queue. Data structure with keys which supports inserting new items and removing item with largest (highest priority) key. Stack where first out is based on priority order. Priority, Empty (if queue is empty), Insert, removeMax. Heapsort (Java). – http://www.cs.wisc.edu/~cs367-1/NOTES/12.PRIORITY-Q.html
  • Print out nodes of a tree in level order (first level, then second level, then third level…)


  • Indexes
    1. What is an index?
    2. Which of these data structures would be appropriate for storing it and why: Array, Linked list, Hash table, B-Tree
    3. Given table (Employee) with columns (id, name, department, start_date) and index (id), will the following statements be slower, faster, or the same after adding an index on (start_date)?
      update Employee set name = 'John Doe' where id = 3                      /* same   */
      update Employee set start_date = '2009-01-01' where id = 3              /* slower */
      update Employee set name = 'John Doe' where start_date = '2009-01-01'   /* faster */


  • Employees
    • id
    • name
    • department_id
    • start_date
    • salary
  • Departments
    • id
    • name

Queries to ask for:

  • Get the names of all employees sorted by start date.
  • Get the name of the person that makes the most money.
  • Everyones name and the department they are in.
  • What department spends the most on salary?


  • CustomerOrders
    • CustomerId
    • OrderId
    • OrderDate

Write SQL queries to find:

  • All of the CustomerId’s of all the customers who placed orders today
    • select CustomerId from CustomerOrders where OrderDate = sysdate;
  • CustomerId’s of all customers who placed an order yesterday and placed an order today
    • select CustomerId 
      from CustomerOrders 
      where OrderDate = sysdate - 1 and 
      CustomerId in (
          select CustomerId from CustomerOrders where OrderDate = sysdate
  • All of the CustomerId’s for customers who have placed 5 or more orders on our website in the last 30 days
    • select CustomerId, orderCount from (
          select CustomerId, count(OrderId) as orderCount from CustomerOrders where OrderDate > sysdate - 30 group by CustomerId
      where orderCount > 5 order by orderCount desc;
  • In the past 30 days, the date with the most orders placed
    • select (select count(OrderId), Date from CustomerOrders where OrderDate > Sysdate - 30)...


  • Find and replace phone number in 10,000 web page files?
    • The user should come up with tools like grep and sed and give a rough defintion for a regular expression to match the number.
    • If candidate start heading down the writing an full applicaiton path, indicate they are logged in on the machine with the files and need to do this as quickly as possible. Note this.
  • Describe a really cool script you wrote. Outline why you believed the script was required, how you approached writing the script (which scripting language did you use? why?) and whether the desired result was achieved. If so how.

Perl programming

  • What are the 3 data types in PERL

Perl has three built-in data types: scalars, arrays of scalars, and associative arrays of scalars, known as “hashes”. A scalar is a single string (of any size, limited only by the available memory), number, or a reference to something (which will be discussed in perlref). Normal arrays are ordered lists of scalars indexed by number, starting with 0. Hashes are unordered collections of scalar values indexed by their associated string key.

  • What is the difference between a hash and an array?
    • A hash is keyed.
  • What is Perls function to display all elements of a container.
    • Data::Dumper
  • How do you use Data::Dumper?
    • use Data::Dumper;
    • Dumper($variablename);

Python Programming



Q: What are Python’s built-in data types? A: list [“v1”, “v2”], dictionary {“a”:1, “b”:2}, set, tuple (“v1”, “v2”)

Q: What is the difference between a list and a tuple? A: A list is a mutable type while tuple is not.

Q: What is the difference between a mutable and an immutable datatype; pros, cons? A: Somewhat open ended; want the candidate to recognize that the value of an immutable type can’t be changed. For most languages this means both a memory and performance hit when using immutable types (as “modifying” calls result in a new object being created). The benefit of using an immutable type that one doesn’t have to worry about side-effects.

Q: Sequence slicing (return the first/last elements of a sequence) A:

  "foobarbaz   "[-3:] => "baz"
  "foobar"[3:] => "barbaz"

Q: What is a lambda function? A: Lambda functions are akin to anonymous functions in other languages (but hobbled, imho).

Q: How do lambda functions differ from normal functions? A: lambda functions may consist of only a single expression.

Q: What are generators? A: A generator is a way to build an iterator that delivers data on an as needed basis. It holds enough state to be able to compute the next result. The advantage of generators is that they can be used to generate long sequences without holding the entire sequence in memory (or even having a sequence in memory).

Q: Python has OO baked-in, along with inheritance. Does it support single or multiple inheritance? A: Multiple.

Q: If multiple how does it resolve methods A: Old-style classes use left-to-right resolution (i.e. it depends on the order in which base classes are specific. New-style classes have a dynamic resolution mechanism akin to Lisp’s “call-next-method”

]Log slicing

  • Something requiring a “cut | sort | uniq -c”
  • Example is here
Count requests by ip

Hopefully you arrive at this question from a more general troubleshooting one, such as “A host is getting more traffic than others”.

Find the number of client IPs with the most requests to this host, given the following log format:

s200 2008-02-13T12:36:32-0800 1DH0AN6B01KADEMZ342W 21055 7659 HTTP/1.1 GET /gp/goldbox/display

Rename a bunch of files from .jpg to .jpeg


Write stringreverse in perl.

Good answer
  • Inefficient, but a good idiomatic perl answer one candidate gave:
$s = join ("",reverse (split //,$previousString));
  • Better:
$s = reverse($previousString);

Note that reverse() must be called in a scalar context for this to work as intended. Anyone claiming intermediate or better knowledge of Perl should know what “scalar context” means.

The Soft Stuff

  • Describe your work style (Structured / unstructured, what techniques you use for time management etc)
  • What do you feel you can contribute to EC2 Operations Engineering?
  • Who was your toughest customer and why?
  • How do you know if your customers are satisfied? look for demonstrable measurement
  • Outline a situation where you had to deal with a difficult customer/supplier. What did you learn from the engagement and what would you do differently with the benefit of your experience?
  • Bias for Action
    • Note where and if the candidate: Is able to evaluate facts, data and information effectively | Draws logical conclusions decisively | Uses own initiative to drive forward | Strives for challenging goals | Prioritizes effectively | Responds quickly | Responds resourcefully | Takes calculated risks
  • What are the most important objectives in your current job? Why?
  • What is your greatest professional technical or non technical accomplishment and why?
  • How about your worst professional failure? What was the most important lesson learned?
  • Can you provide an example where you had to make an important business decision, how you went about deciding what to do, and what alternatives you considered?
  • Tell me about a time when you have been faced with a challenge where the best way forward or strategy to adopt was not “clear cut” (i.e. there were a number of possible solutions). How did you decide the best way forward?
  • Can you give me a specific example of when you have thought through a number of options to a particular problem? (How did you arrive at your preferred solution?)
  • Can you provide an example where you had to learn something very quickly in order to address an urgent situation?

In-house Interview


Focus here will be on “dive deep” knowledge plus approach to systems monitoring and troubleshooting.

How do you debug a broken or non-functioning cron job?

Run either crontab -e, crontab -l, or check /var/spool/cron. Bonus points if they say to run the script manually, extra super bonus points if they check the crontab for any custom env settings, or run it under cron with all output saved to a file (instead of the default e-mail, which might be broken for other reasons).

Web services monitoring

If you were to setup a basic monitoring system for a simple 3 tier web application (frontend web, application, database) what are the key elements you would monitor?* Below is a basic and far from exhaustive list… Feel free to delve deeper by asking questions like : What commands would you use to determine the information? How would you track or trend the information over time (graphing)? What graphing tools have you used in the past? How would you weed out false alarms or duplicates?

    • Per node Diskspace usage
    • Per node CPU usage
    • Per process CPU usage for critical processes
    • Per process memory usage for critical processes
    • Per node Swap Usage
    • Per node / cluster Disk I/O
    • Power/Cooling traps on Chassis/Servers
    • Per node/ interface NIC Throughput
    • Number of active TCP connections
    • Basic network connectivity between tiers
    • Application specific connectivity between tiers
    • Database Connections / Pooling
    • Monitor the number of active queries on the DB
    • Monitor the average response time of queries on the DB
    • External ping / http connectivity to frontend
    • External traceroute connectivity to frontend
    • Log parsing on webserver / app / database logs for strings like CRITICAL or ERROR
    • Etc…..

A server crashes and reboots in a remote datacenter. How do you determine what happened?

    • /var/log/messages/ mentions no shutdown.
    • last shows no one logged in at the time, nor does it show a reboot.
    • Disk space is not full.
    • dmesg returns no hardware errors at last boot time
    • No crash dump files.
    • Application logs show no issues.
    • fsck reveals no issues.
    • Server has a RAID array, did candidate check RAID diagnostics (no issues found)?
    • Did they check with the NOC?
    • NOC reports maintenance on the server rack in question, possible loose power cable.


Focus here will be on “dive deep” knowledge plus approach to systems monitoring and troubleshooting.

  • Name 3 to 4 tools are essential for troubleshooting when another host goes down? Ping and traceroute at least. Other good choices are arp, nslookup, dig, ifconfig, host, route, netstat, and mtr perhaps tcpdump and other sniffers, depending. ( mii is a Linux specific tool, good to show the negotiation speed ( ifconfig on *BSD systems shows this data).
  • What packets are sent to establish a TCP session or “handshake”? SYN, SYN+ACK, ACK
  • What packets are sent to close a TCP session? FIN, FIN+ACK, ACK
  • What about UDP? There are no stages for UDP…it is unreliable.
  • When would a UDP connection be preferable for a TCP one? UDP is lightweight, and good for DNS lookups and such, where a dedicated connection isn=t needed or where some packet loss is acceptable as in streaming media. TCP connections are more ideal for persistent connections where a lot of data will be exchanged, such as SSH.
  • Where do the DNS client settings for a Unix host reside? Expect /etc/hosts, /etc/resolv.conf, nsswitch.conf (or similar lookup tool). See if they know the host, nslookup or dig commands.
  • How would you exclude SSH traffic from tcpdump output If they know (or claim to know) tcpdump, see if they know the syntax “not port 22” (e.g. tcpdump not port 22) or something similar.
  • What are the differences between OSPF and RIP? RIP builds a routing table using hops (or metrics) from a specific router, while OSPF has routers advertising themselves and their adjacent routers. OSPF is more scalable and removes risk of “hop loops”, while RIP is more efficient. In short, RIP is a distance vector routing protocol, OSPF is a link state routing protocol.
  • Mention/Describe a few loadbalancing algorithms:
    • Round-robin
    • Least-connection (default for load-balancers)
    • Weighted round-robin
    • Weighted least-connection
    • Bandwidth and load-based.
    • Vendor specific load-balancing implementations OK too.
  • How do you identify the start of a connection with packet sniffing tools? Expect TCP 3-way handshake(SYN, SYN+ACK, ACK), TCP sequence numbers.
  • Debug a DNS lookup problem: Fully qualilfied DNS names work, unqualified names timeout. Why? (Root cause usually that recursive DNS lookup broken: fully qualified names go directly to the authoritative source, while recursive wander up to root name servers and then back down again.) Really need dig utility to see what is going on here, assuming no access to the DNS servers.
  • Open-ended discussion started with “What is ping?” question.
    • How does ping work?
    • What else ICMP is used for?
    • Which ICMP packets should be blocked at firewall? (REDIRECT)
    • Which ICMP packets should never be blocked at firewall (DEST_UNREACH, TIME_EXCEEDED)
    • What happens if all ICMP is blocked at firewall? (TCP path MTU discovery doesn=t work, clients unable to download files from your server if they have MTU less than 1500)
    • Do you know another network diagnostic tools? (traceroute, mtr, tcpdump, wireshark)
    • How does traceroute work?
    • What kind of packets does traceroute use by default?
    • Why traceroute uses UDP by default, not ICMP?
    • What means if some hops are not shown by tcpdump?
    • How to measure network roundtrip time to some specific host?
    • Why UDP and ICMP is better that UDP?
    • If using ICMP echo requests or UDP, if the request or response packet gets lost, you don=t get network roundtrip time and try to measure it again.
    • If a packet gets lost in TCP, you still get a response, but it will be delayed, and instead of network roundtrip time you get some nonsense, with no indication that it=s nonsense.

the Soft Stuff

  • What’s the biggest Ops crisis you’ve ever faced?

Need to create some sort of operational scenario test here. Positive attributes:

  • Care to make sure customers are not impacted? (take hosts out of service first)
  • Experimentation on a small scale before making a large-scale change
  • Always have an “undo” plan (take backups first)
  • Communication to potentially affected parties before making changes
  • Review from peers

Negative attributes

  • “Ask my manager first to make sure it=s okay”
  • Make knee-jerk reactions and solutions
  • What are your career aspirations? How do you see your career developing over the next few years?
  • Why are you leaving your current position?
  • Why do you want to work for EC2 OpsEng?
  • How do you deal with a situation where you’re asked to perform a task and you don’t know how to do it? Communicate!! Inform the requestor so you manage expectations
  • How do you keep your manager informed about what is being done in your work area?
  • If you were interviewing for this position what would you be looking for in the applicants?
  • What has been your biggest contribution at your most recent job?
  • Tell me about a time when you had to resolve a difficult technical problem
  • Tell me about a time when you had to take a decision that was not your responsibility to take

Whiteboard code assignments / scripting

Small scripts

  • given 2 files with a 1-per-line integer in sorted uniquified order, produce a new file containing the union of the 2 input files.
  • given 2 files find the lines that occur in both.
    • Possible answer
    • build a hash from one file then count how many times seen in other file
    • what if the files are to big to fit into memory?
    • built a partial hash from 1/x of the first file then iterate over the other file
    • then build a hash from 2/x then iterate over the other file

IP in network

  • Presentation
  • Given an IP address, say, and a CIDR network, say,, write me a program or function, in any language you like, that will return true iff the IP is within the network, and false otherwise. (Or whatever the language calls true and false).
  • Goal / Competencies
  • Non-shallow networking knowledge — what *is* in IP address? A network? How are they defined?
  • Bit operations — the most straightforward way to solve this is using bit operations, or the arithmetic equivalents.
  • Sensible Scripting — there are many opportunities to reuse code (IP and network must be converted, it’s the same operation to covert each octet), refactor, choose a clear or unclear way of doing things, and lots of places to do input validation.
  • Bad signs
  • Regular expressions, except for splitting.
  • No code reuse/iteration — converting each octet into a dedicated variable, and then munging them together into another.
  • Trying to solve it via string comparisons
  • Trying to solve it via numerical comparisons of each octet, and never saying “there must be an easier way to do this”.
  • Regular expressions.
  • Good signs
  • Knows an IP is a 32 bit int.
  • Knows an IP is potentially a 64 bit int.
  • Can convert the dotted quad to same, or figures out method.
  • If they can’t solve off the bat, works it out methodically (expands net & ip to binary, looks for pattern, etc.) I’ve had a candidate who had no idea how this was done internally, but once the integer equivalence was explained, worked it out from first principles inside 15 minutes.

Advanced log slicing

  • Presentation
We have a log format that’s records divided by ‘-‘*50, which has key-value pairs, one per line, with each record representing a request to a web site. It contains all the HTTP headers, and custom values such as the time to render a page.
Say an internal customer, a backend service provider, comes to you and says “I have these weird traffic patterns to my service, spiking every hour. If I give you a list of the URLs that use my service, can you tell me if there’s anything odd about the traffc to the site?”.
Two things; what would be useful data to gather here, and write a script (quick and dirty is ok to start, then refining) to gather it.
EXTRA CREDIT: OK, what’s the time complexity of this? How does it scale?
  • Goals/competencies
  • Traffic analysis/HTTP. What are the useful headers? What might this be? User agent, referrer, source address, all are a good start — this will of course be coloured by the last such problem the candidate worked on :)
  • Slightly advanced command line — grep -C/-B/-A makes this a lot easier, as do other somewhat under-advertised command line tools and options
  • Data structures — if the candidate gets to the stage of making this production and dealing with edge cases, then they need to represent the records somehow. All scripting languages have an easy to use dictionary data type, which maps well here (if you’ll pardon the pun).
  • Complexity/performance — can they figure out how their solution performs and speed it up?
  • Bad signs
  • They can’t solve it. At all. Just up and give up. Happens more often than you’d hope.
  • They don’t think about it, and try moving random greps around until the solution appears.
  • No ideas what’s useful information to gather, even if it’s just “well, is it a robot, who is making the request?”.
  • No idea how their less-hacky version scales/performs, and the answer is ‘badly’. I’ve had candidates manage to write n^2 code for just reading the file into memory.
  • Try to load the whole file into memory AND can’t think of a way around this when you say “the file is size-of-ram+1 bytes in size”
  • Good signs
  • They know what are common tell-tales of screen scrapers, unidentified robots, etc.
  • Their quick hacky version is written quickly, and is good enough. (Assume these lines are always within 5 lines of the url, so we can do grep -C5…)
  • They can write a first pass at a correct tool (no such proximity/order assumptions, doesn’t read entire file into memory, can pull the data into a useful structure) within the interview time. Handwaving on Getopt format or things of that nature is fine.



Application Development Expertise

What programming or scripting language are you the most familiar with (what other languages do you have experience with)?

What change management, software version control, or deployment systems (or processes or methodologies) do you have experience with?

What automated-build tools or processes have you used?

Network Expertise

Please describe how a load balancer functions; what load balancing algorithms might you want to use and under what circumstances, what application characteristics make it difficult to load balance traffic (e.g. stateful applications)?

What is the difference between bandwidth and latency?

Describe how bandwidth and latency might come into play when diagnosing a potential network performance problem?

What is the difference between TCP and UDP?

What is a WAN optimizer and how do they typically work?

Storage Expertise

What is the difference between NAS and SAN storage?

Which one of these kinds of storage lends itself to the Cloud and why?

What are IOPs (how are they typically measured)?

What is RAID (do you know any different RAID types)?

Infrastructure Expertise

What is a hypervisor?

How does one typically go about deploying a virtualized (Vmware, Hyper-V, Openstack) infrastructure?

Please describe a typical LAMP, Java or .Net application stack?

Database Expertise

What are the typical constraints of a Relational Database System?

How would you recommend increasing the availability of an RDBMS (please describe how DB mirroring or log shipping works)?

How would you recommend scaling a RDBMS?

What are NoSQL databases?

Big Data/Analytics

Do you have any experience with or knowledge of Hadoop (how does Map Reduce help analyze large amounts of data)?

What are the differences between typical data warehouse workloads vs. traditional relational database workloads?

How do data warehouses typically structure their data to account for data warehouse-type workloads (e.g. columnar data store)?


What is the difference between symmetric and asymmetric cryptography?

What are some transport encryption protocols and how do they work?

What are some ways to encrypt data-at-rest?

What recommendations would you make to a customer about using the Cloud securely?

Content Delivery

What is a CDN (or Content Delivery Network) and how does it work?

Do you know how CDN POPs work (how a user is directed to a particular POP)?

Do you know the difference between a cache hit and a cache miss?

What are the advantages/disadvantages of using a CDN?

Performance Troubleshooting/Tuning

If a customer called up and said their application ran unbearably slow only on our platform, how would you start troubleshooting?

What are some ways a database/web server/app server can be tuned?

What are some performance and monitoring tools that you are familiar with?

What are some techniques for scaling compute? Storage?

Do you have any questions for me?





Time: 4 minutes

  • Q: User in Seattle is complaining of slow responses

Time: 4 minutes

  • Q: Walk me through HTTP or DNS

In both cases we’re looking for understanding of the protocols. Question about a time where knowing how these protocols influenced their actions or they used that knowledge to solve a problem.


Time: 5 minutes

  • Q: What are 5 metrics you would want to monitor a Linux box?

Key here is the follow up. Specifically ask about:

  • Why those metrics? What is important about them? Try to understand what the motivation was.
  • Dive into 2 or 3 of the metrics. For example:
    • CPU: What about the CPU? What about context switches, what is system time, why might I care? (etc)
    • Network: What about it? Packet drops vs memory used vs utilization
    • Disk: What about it? Is this space or IOPs? inodes vs free space, open file handles vs reported free space. How do you monitor which programs are using IOPs?

( these cribbed from BigBird/Operations/Hiring )

Time: 3 minutes

What is a system call ? – A call to the kernel – Access to a resource controlled by the kernel – Used to separate User Space and Kernel space

What a a system call you used recently and why did you use it?
– read, open, write, close to files are the most common here
– GetHostByName, GetHostByAddr are other common answers
– interested in recording the why to see if the candidate understands what a system call is.
Why do we have System Calls?
– As above its to gain access to resources owned by the kernel
– The good answer here is the kernel is a gate keeper for access to physical devices as if multiple processes were to access a device at the same time its easy to crash the system. The system call provides a mechanism to control access to constrained resouces, primarily hardware.

How do I check the system calls being made by a process ?
– using the strace command
What is a Kernel Module ?
– Object file that contains code which when loaded extends the running kernel with out reboot, usually used to add device drivers.
How would I change the kernel module for my raid card ?
– lsmod to see whats loaded
– modprobe or rmmod to remove it
– modprobe or insmod to insert it
– (modprobe is the newer way, the insmod / rmmod are older)
That will do it until I reboot how to I make it persist ?
– put the module in /lib/modules/$(uname -r)
( uname –r is the kernel version of the machine )


Just going to focus of Deep dive and bias for action very quickly.

Time: 5 minutes

  • Q: Tell me about a time when you had to analyze facts quickly, define key issues, and respond immediately to a situation. What was the outcome?

Time: 5 minutes

  • Q: Tell me about a problem you had to solve that required in-depth thought and analysis? How did you know you were focusing on the right things?