Anthony J. Foiani
As of July 2024, I am primarily seeking either a hybrid role in the San Francisco Bay Area or
a fully remote position.
A printable, one-page summary of this resumé is available in
A4 or
US Letter
format.
Introduction
To any role, I bring an unusual combination of experience and knowledge:
-
As an SRE, I have experience across a wide range of company sizes, and across
multiple technology stacks. Compared with most of my SRE colleagues, I have a more classic
computer science and software development background; this hybridizes well with the more
common "admin" background, creating solutions with the best of both traditions. I've been
deeply involved in incident management processes (including creating and teaching classes)
and have often been the "governance, regulatory, and compliance" (GRC) point-of-contact.
-
As a software developer, I bring wide and deep experience across programming languages,
operating systems, specific technologies, deployment scales, team and company sizes, and
development methodologies. This background has been particularly useful when interfacing with
heterogeneous systems, especially legacy systems.
In all my roles, I strive to improve understanding: presenting options to leadership, learning
from colleagues, mentoring new teammates, educating users, providing refined compliance
evidence, and clarifying/codifying processes. This is the best investment we can make for the
future, and it always pays off.
Philosophy
“... the most important function that software builders do for their clients is the
iterative extraction and refinement of the product requirements. ... in planning any software
activity, it is necessary to allow for an extensive iteration between the client and the
designer as part of the system definition.”
— Fred Brooks, The Mythical Man-Month
Communication is a critical component of any successful system; we must understand the problem
we're solving before we can build a solution.
I've been privileged to work for companies of many sizes, on projects ranging from
single-person to teams of hundreds. In every case, communication was vital: requirements,
constraints, stakeholders, plans, prototypes, reviews, implementation, maintenance, and
evolution.
In many of these cases, my ability to understand technical systems together with the human and
organizational side has allowed me to translate across multiple groups and dramatically reduce
misunderstanding.
Between users and implementors: Does this do what needs to be done? Is the usage clear? What
are the corner cases? Where do you think this might go in the future?
Between different teams of implementors: What are the interfaces? What really needs to be
exposed? What technologies are we assuming / preferring / avoiding? What platforms will be
used?
Within a single team: How can we structure this for simplicity? Can we generalize it? Can we
re-use existing tools? How do we document this, especially for maintenance?
Between individuals, there's mentoring and education. I've discovered that I'm good at this,
and I was surprised by how fulfilling I found this aspect of my roles.
Skills
Instruction and Consulting
I have a knack for extracting requirements from users and managers; an ability to help
different parties gain a shared understanding; the technical background to understand
systems deeply; and a passion for encouraging the best solution for each situation while
minimizing overall complexity.
-
Mentoring:
- New employees
- Internal transfers
-
Training:
- Research & Development
- Composition & Presentation
- Training-the-Trainer
-
Facilitating Communication:
- Understanding multiple parties
- Translating between them
|
-
Promoting Best Solution:
- Identify strengths and weaknesses of systems
- Evaluate groups’ needs against those systems
- Optimize and compromise to minimize complexity
-
Extracting Requirements: user studies, similar systems,
design patterns
|
Computer Languages
Expert: |
Skilled: |
Some Experience: |
- Python
- Terraform
- Regular Expressions
- Shell Scripting (+
awk , sed , etc)
- C++ (+ Boost)
- C
- Makefiles
- Perl
- Javascript / Typescript
|
- Emacs-Lisp
- Go ("golang")
- Helm
- Java
- Ruby
- SQL (+ Optimization)
- Visual Basic for Applications (VBA)
|
|
Operating Systems and Platforms
Expert: |
Skilled: |
- Kubernetes ("k8s")
- Amazon Web Services (depth)
- Linux (1995+): userspace, kernel, and embedded
- Monitoring: DataDog, Grafana, proprietary
|
- Amazon Web Services (breadth)
- Containers / Virtual Machines
- U-Boot firmware loader
|
Amazon Web Services
Skilled: |
Some Experience: |
- Compute: EKS, ECR, EC2, Lambda
- Networking: VPC, Security Groups, Subnets, Gateways, Load Balancers
- Data Stores: S3, RDS (Aurora, Multi-AZ, MySQL, PostgreSQL), Redis, OpenSearch
- Monitoring: CloudWatch, CloudTrail
- Web Serving: CloudFront, ACM
- Identity: AWS Identity Center (a.k.a. Single Sign-On)
|
- ECS / Fargate
- CloudFormation
- Elastic Beanstalk
- Multi-Region Deployment
- ElastiCache (Redis)
- CodeBuild
|
Data Representation and Interchange
Expert: |
Skilled: |
- JSON
- YAML
- Email (SMTP, DKIM, SPF)
- Unicode (e.g., UTF-8)
- XML, HTML
- CSV, other ETL
|
- Compression Algorithms and Archive Formats
- Protocol Buffers
- Graphics: PNG, GIF, PBM, JPEG, SVG, PDF, PostScript
- Text Processing: LaTeX,
nroff , RTF
- ASN.1
|
Digital Security
Skilled: |
|
- Threat Modeling
- Identity Providers (SSO, SAML)
- PKI (Certificates, CA, X509)
- Hardware Tokens / U2F
- Secure Shell (SSH)
- OpenSSL (command-line and API)
- CMS
- ASN.1
|
|
Compliance
Skilled: |
|
- Technical Controls
- Creating and improving processes
- Maintaining Certification (SOC 2, ISO 27001)
- Generating evidence for audits
|
|
Embedded Development
Expert: |
Skilled: |
- Linux Kernel customization
- Toolchains Creation and Use
- I2C Bus
- Flash Memory (NOR vs NAND, MTD / UBI etc)
- Realtime Constraints
- Hardware Interfaces
|
- Boot Loader
- Device Tree
- Serial Ports (RS-232, RS-485, etc)
- Oscilloscopes / Logic Analyzers
|
Programming Techniques
Expert: |
Skilled: |
- Test-Driven Development
- Refactoring
- POSIX Threads
- C++ RAII
- Profiling / Optimization (both low- and high-level)
|
- Design Patterns
- Pair Programming
- Input Fuzzing
- Packaging
- Disassembly / Reverse Engineering
- Google RPC
|
Source Code / Configuration Management
Expert: |
Skilled: |
Some Experience: |
|
|
- Mercurial (
hg )
- Perforce (
g4 )
|
Networking Protocols
Expert: |
Skilled: |
|
- IP (TCP, UDP)
- BSD Sockets API
- HTTP
- Telnet
- FTP
|
- FTP
- NTP
- SMTP
- SNMP
- SSH
- SSL/TLS
- TFTP
|
Google Internal Tools
Probably somewhat dated, as of 2024
Expert: |
Skilled: |
-
Production Deployments: GCL/BCL, Borg (especially Dedicated Machines and SSD), GSLB,
cluster migration, capacity planning
- Monitoring: GMon, Monarch, Mash, Viceroy, BorgMon, Nebgua
- Data Storage: Colossus, Effingo, Placer, BigTable, Piper
- Data Analysis: GoogleSQL (Dremel, F1),
gqui
- Search Technology: SuperRoot, Laelaps, Muppet, Raffia, Union, ST-BTI, FBM
|
- KeyStore
- Piccolo
- Spanner (especially Spanner Queues / Manifold)
|
Specialties
Skilled: |
Some Experience: |
- Nuclear Safeguards Instrumentation (Neutron Counting hardware and software)
- Open Source licensing
- Journeyman-level Electronics
|
- Computer Graphics
- Basic competency in German
|
Ancient Skills
Many older skills have been moved to
another document.
Experience
Remote; January 2024 — July 2024 (6 months)
Developer Enablement
Responsible for all Observability, Incident Management, Service Catalog, and SLOs.
- Improved and extended incident reporting / analysis.
-
Educated non-SREs (and in some cases non-developers) on metrics, SLOs, and other
observability topics.
- Helped clarify charter of new group.
SRE North
Interim assignment, comprised various ex-embedded SREs and some new hires.
- Assisted where external / generic skills were helpful (Terraform, SQL tuning, etc).
San Francisco, California, USA; August 2022 — October 2023 (1 year, 2 months)
Cloud Operations
Responsible for all infrastructure, including multiple production AWS accounts / EKS clusters,
as well as staging clusters. Reverse-engineered a complicated setup that had evolved over
time, with most of the original authors departed, with an eye to updating to modern practices.
- Assisted product version upgrades / releases
- Upgraded multiple EKS clusters
- Spearheaded our IMDSv1 to IMSDv2 migration
- Managed our Sendgrid email configuration (DKIM, SPF, DMARC)
-
Managed, evolved, and streamlined our solution for provisioning custom domains for
customers
Education
Helped teammates across the organization understand the benefits and limitations of our
platform. Collaborated to obtain solutions that were secure, compliant, effective, and
efficient.
-
Promoted uniform monitoring so all engineers could see how well existing systems were
running
-
Consulted with multiple other teams to provide secure and compliant solutions for specific
needs.
Compliance
Enabled our compliance team to achieve and maintain SOC 2 and ISO 27001
certifications. We also maintained a clean separation for data which fell under the GDPR.
- Wrote custom scripts to probe the boundaries of a complicated deployment
- Acted as security point-of-contact on the CloudOps team
Production Access: AWS SSO
Most of the effort involved in the AWS SSO Migration was ensuring that any new solution
satisfied existing access requirements. Given that the environment setup was legacy and
under-documented, this was a substantial challenge.
- Worked closely with our Corp IT team to use our Okta instance for athentication
- Iterated to ensure the SSO roles were sufficient but not overly broad
- Migrated all ad hoc users to AWS SSO for all AWS accounts
Mountain View, California, USA; November 2020 — June 2022 (1 year, 7 months)
Production Engineering
At the time of my departure, Production Engineering was still
the only 24×7 rotation within the company; while daytime
alerts were directed at more specific teams, we were the only
ones available outside business hours.
-
Was primary oncall for a 24×7 week-long rotations with a
5-minute time-to-action SLO
- Handled incident response (ongoing management, postmortems, reviews)
-
Triaged (and sometimes handled) miscellaneous requests from other users ("interrupts")
- Updated documentation (playbooks, checklists)
- Mentored teammates as they went oncall
- Ran "table top" oncall exercises, complete with postmortem writeups
Compliance
Airtable maintained SOC 2 and ISO 27001
certifications. Keeping these certifications required regular
work; some scheduled (e.g., review who has access to which
systems every quarter), and some on demand (evidence gathering,
security patching).
- Was technical contact on the Prod Eng team for Compliance team
- Performed and streamlined Quarterly Access Reviews
- Provided evidence for SOC 2 and ISO 27001 audits
- Helped remediate security concerns
Production Access Onboarding / Mentoring
Airtable restricted access to sensitive environments to a small
number of engineers. This access required a separate laptop and
specific Security Team approval; coordinating that process for
dozens of users required documenting, revising, and finally
optimizing the steps required. (This was especially true as the
duties that used to be on a single team were spread out to
almost a dozen.)
- Enabled dozens of users to have "full production access", including training
- Evolved the onboarding process: optimization, documentation
- Handled tool evolution
Mountain View, California, USA; October 2013 — July 2020 (6 years, 9 months)
YouTube Trust & Safety SRE
Original Tech Lead for the
Site Reliability Engineer (SRE)
team formed to support and productionize the
YouTube Trust & Safety tools (for managing
abuse, fraud, child safety, etc).
-
Founding member of a new team of 2 SREs, which grew to 5 within a year:
- Created infrastructure (permissions, mailing lists, group memberships, etc)
- Established team culture
- Initiated regular meetings with product developers
- Reviewed incident postmortems alongside product developers
and managment, and helped clear a backlog of prior incidents
- Investigated multiple aspects of existing
developer-supported systems, then presented that knowledge to
multiple groups as well as to the new SREs
- Explored multiple options for consolidating and hosting
those systems (investigation and initial feasibility)
- Assisted emergency response to the emerging COVID-19 situation
YouTube Search & Discovery SRE
First member of the
SRE
team dedicated to managing
YouTube's
“content discovery” systems: Search,
Personalization, Watch Next, Recommendations, etc.
-
Worked closely with developers to deploy a custom instance of
the Google WebSearch technology stack for YouTube content
-
Optimized that search stack by using more advanced container /
cluster features (saved ~20% out of O(1M) CPU cores)
-
Senior member of a 24×7 oncall rotation with a 5-minute
response SLA
-
Mentored 10+ new/junior SREs to full solo oncall capability
-
Managed services deployed globally on millions of CPU cores:
cluster migrations, organic growth, feature launches, multiple
releases per week
-
Handled multiple large public-facing incidents, including
postmortem creation, analysis, and followup
-
Worked closely with our sister team in Zürich,
Switzerland (multiple in-person trips, weekly video
conferences, daily status handoff emails)
-
Co-owned responsibility for our “panic room”
(providing privileged access to production networks in case of
an on-corp / in-office network outage)
Internal Consulting, Educating, Mentoring, and Interviewing
This wasn't a distinct role; instead, it calls out the areas
where I specialized and providing extra value to my teams.
-
Mentored many peers on a 1-to-1 basis:
- Brought 10+ SREs to full solo oncall ability
- Maintained a list of resources for new SREs
-
Supported other oncallers during their shifts:
-
Was often the designated contact person for new SREs
during their first few solo shifts
-
Was the YouTube SRE group expert in multiple technologies
(e.g., Search, Laelaps, GoogleSQL).
-
As a senior member of the overall team, often helped
manage and resolve massive incidents involving multiple
shards of YouTube SRE.
-
Developed, presented, and trained others to present courses on topics including:
- GCL — a custom configuration language with highly unusual semantics
- GSLB — Google's global load balancer, which routes trillions of requests per second
- ST-BTI — an obsolescent storage and indexing solution which a team wasn't yet able to migrate away from
- YouTube Search — a custom instance of the Google WebSearch stack
- “Going On-Call for Developer Rotations” —
introduced hundreds of developers to the principles of going oncall
-
Helped fellow Googlers across the company with questions in my fields of expertise:
-
GCL — the most widely used configuration language
within Google
-
BigQuery / GoogleSQL — Google's implementation of
standard SQL on top of petabytes of protocol buffers
(and
gqui
, an ad hoc query engine for
those same files)
-
Production Management — especially with rarely-used
edge cases at the intersection of virtual and physical
machines
-
Regularly received “peer bonuses” for this
assistance (being beyond my regular job duties)
-
Interviewed 50+ candidates:
-
Specifically volunteered for interviews with candidates
from more diverse backgrounds (and received a “peer
bonus“ for using inclusive language in my
feedback)
-
Became a “calibrated” interviewer (my scores
were generally aligned with other interviewers' and
ultimately with those of the hiring committees)
-
Helped build and maintain team culture and cohesiveness:
- Researched and initiated inclusive events
- Planned and ran multiple off-site activities
- Took photographs to share with the team, organization, and company
Albuquerque, New Mexico; July 2009 — October 2013 (4 years, 3 months)
Universal Non-Destructive Assay Platform: Software Architect / Implementor
-
Worked with a multidisciplinary, international group including:
- Adapted Linux and supporting libraries to custom embedded processor
- Designed and built custom software for realtime data acquisition
- Provided high security data transfer and device configuration
- Wrote and generated in-depth API / extension documentation
- Advised a team new to Linux and many other
current technologies (XML, HTTP, TLS, NTP, etc).
- Integrated many technologies while creating a long-running
unattended data acquisition system, including:
- Busses: I²C, USB, PCI
- Security: OpenSSL+OpenSC+PKCS11, tamper sensors
- Web-Based UI: HTML, JavaScript, AJAX, CSS
- Real-Time Processing: Threads, Watchdogs, Optimizations
- Worked with electrical engineers and digital designers
- Provided project administration: version control, builds, and
release management
San Diego, California; October 2004 — July 2009 (4 years, 7 months)
Worldwide Data Distribution System: Architect / Implementor
- Adapted existing system for serving data to hundreds of
thousands of nodes around the world
- Documented a substantial corpus of existing code
- Created a system for publishing that documentation to company standards
Mobile Entertainment Provisioning: Architect / Implementor
- Put Yahoo! Music onto mobile phones:
- Multi-tier architecture (J2EE, Tomcat, AXIS, AJP)
- Multi-client presentation (WAP, XHTML, JSP)
- Dealt with non-traditional (Mobile vs. PC) browser factors: memory,
display, latency, bandwidth
- Used custom packaging / deployment technology (similar to RPMs);
became site expert on technology (out of 70-100 engineers)
- Pushed technology envelope within a large company
- Early Linux (RHEL4) adopter
- Early J2EE (Apache/Tomcat/AXIS) adopter
Backoffice Data Reorganization: Manager / Architect / Implementor
- Re-organized some 60TiB of business-critical live data
- Coordinated 10 people doing various aspects of necessary work
- Wrote before/after comparison scripts to validate the operation
- Designed and wrote helper utilities to make the motion
transparent to client processes
Miscellaneous Knowledge Sharing
- Shared in-depth knowledge of Perl and Unix on company
mailing lists
- Facilitated a “add your own map” mashup on the Y! Maps site
- Implemented a fast graph search algorithm for a remote colleague
- Contributed to various open-source packages
MusicMatch.com
San Diego, California; November 2001 — acquired by Yahoo! in October 2004 (2 years, 11 months)
Digital Audio Processing Engineer
Inherited and extended a distributed audio processing system:
- Updated and enhanced a heterogeneous cluster of processing nodes
- Rewrote DSP core in C++
- Handled multiple standards (MP3, AAC, WMA; DRM / no-DRM; tagging)
- Managed 60+ terabytes of audio data and associated metadata
- Helped build rules-based metadata engine for popular and classical audio tracks
- Helped evaluate various encoding / processing schemes
Database Application Programmer
- Worked with Oracle (versions 8, 9, 10)
- Wrote, debugged, supported, and optimized DDL, DML, and bulk loading
- Implemented and supported a
mod_perl
-based administrative interface
(including DHTML features)
Streaming Digital Audio Engineer
- Generated weekly metadata builds providing streaming audio to 30M desktops
- Optimized legacy system to accomodate 100x original design capability
- Helped scale related subsystems
- Extended and optimized browser-based administration tools
Miscellaneous Knowledge Sharing
- Supported and extended existing systems
- Answered Unix / shell / Perl questions
- Optimized database queries
Previous Jobs
Older entries have been moved to the historical file.
Publications
The Perl Journal
Education
Bachelor of Science
Bachelor of Science in Computer Science and Math, with a
minor in German.
New Mexico State University
Las Cruces, New Mexico
Date of graduation: May 1995
GPA: 3.00 out of 4.00
Activities
Community Contributions
Groups
I follow and contribute answers to many lists, including:
Projects
I contribute answers and a few patches to many lists, including:
References
References are available upon request.