Anthony J. Foiani
As of July 2024, I am primarily seeking either a hybrid role in the San Francisco Bay Area or a fully remote position.
A printable, one-page summary of this resumé is available in A4 or US Letter format.
Introduction
To any role, I bring an unusual combination of experience and knowledge:
- As an SRE, I have experience across a wide range of company sizes, and across multiple technology stacks. Compared with most of my SRE colleagues, I have a more classic computer science and software development background; this hybridizes well with the more common "admin" background, creating solutions with the best of both traditions. I've been deeply involved in incident management processes (including creating and teaching classes) and have often been the "governance, regulatory, and compliance" (GRC) point-of-contact.
- As a software developer, I bring wide and deep experience across programming languages, operating systems, specific technologies, deployment scales, team and company sizes, and development methodologies. This background has been particularly useful when interfacing with heterogeneous systems, especially legacy systems.
In all my roles, I strive to improve understanding: presenting options to leadership, learning from colleagues, mentoring new teammates, educating users, providing refined compliance evidence, and clarifying/codifying processes. This is the best investment we can make for the future, and it always pays off.
Contact
Email: | anthony@foiani.com |
Physical: | 1543 Rainier Ave / Napa, California, USA / 94558 |
Phone: | +1 650 296 9563 |
Virtual: | Most common platforms (Zoom, Teams, Meet, Facetime, Chime, etc) |
Philosophy
“... the most important function that software builders do for their clients is the iterative extraction and refinement of the product requirements. ... in planning any software activity, it is necessary to allow for an extensive iteration between the client and the designer as part of the system definition.”
— Fred Brooks, The Mythical Man-Month
Communication is a critical component of any successful system; we must understand the problem we're solving before we can build a solution.
I've been privileged to work for companies of many sizes, on projects ranging from single-person to teams of hundreds. In every case, communication was vital: requirements, constraints, stakeholders, plans, prototypes, reviews, implementation, maintenance, and evolution.
In many of these cases, my ability to understand technical systems together with the human and organizational side has allowed me to translate across multiple groups and dramatically reduce misunderstanding.
Between users and implementors: Does this do what needs to be done? Is the usage clear? What are the corner cases? Where do you think this might go in the future?
Between different teams of implementors: What are the interfaces? What really needs to be exposed? What technologies are we assuming / preferring / avoiding? What platforms will be used?
Within a single team: How can we structure this for simplicity? Can we generalize it? Can we re-use existing tools? How do we document this, especially for maintenance?
Between individuals, there's mentoring and education. I've discovered that I'm good at this, and I was surprised by how fulfilling I found this aspect of my roles.
Skills
Instruction and Consulting
I have a knack for extracting requirements from users and managers; an ability to help different parties gain a shared understanding; the technical background to understand systems deeply; and a passion for encouraging the best solution for each situation while minimizing overall complexity.
|
|
Computer Languages
Expert: | Skilled: | Some Experience: |
---|---|---|
|
|
|
Operating Systems and Platforms
Expert: | Skilled: |
---|---|
|
|
Amazon Web Services
Skilled: | Some Experience: |
---|---|
|
|
Data Representation and Interchange
Expert: | Skilled: |
---|---|
|
|
Digital Security
Skilled: | |
---|---|
|
Compliance
Skilled: | |
---|---|
|
Embedded Development
Expert: | Skilled: |
---|---|
|
|
Programming Techniques
Expert: | Skilled: |
---|---|
|
|
Source Code / Configuration Management
Expert: | Skilled: | Some Experience: |
---|---|---|
|
|
|
Networking Protocols
Expert: | Skilled: | |
---|---|---|
|
|
Google Internal Tools
Probably somewhat dated, as of 2024
Expert: | Skilled: |
---|---|
|
|
Specialties
Skilled: | Some Experience: |
---|---|
|
|
Ancient Skills
Many older skills have been moved to another document.
Experience
Zapier
Remote; January 2024 — July 2024 (6 months)
Developer Enablement
Responsible for all Observability, Incident Management, Service Catalog, and SLOs.
- Improved and extended incident reporting / analysis.
- Educated non-SREs (and in some cases non-developers) on metrics, SLOs, and other observability topics.
- Helped clarify charter of new group.
SRE North
Interim assignment, comprised various ex-embedded SREs and some new hires.
- Assisted where external / generic skills were helpful (Terraform, SQL tuning, etc).
Firstup
San Francisco, California, USA; August 2022 — October 2023 (1 year, 2 months)
Cloud Operations
Responsible for all infrastructure, including multiple production AWS accounts / EKS clusters, as well as staging clusters. Reverse-engineered a complicated setup that had evolved over time, with most of the original authors departed, with an eye to updating to modern practices.
- Assisted product version upgrades / releases
- Upgraded multiple EKS clusters
- Spearheaded our IMDSv1 to IMSDv2 migration
- Managed our Sendgrid email configuration (DKIM, SPF, DMARC)
- Managed, evolved, and streamlined our solution for provisioning custom domains for customers
Education
Helped teammates across the organization understand the benefits and limitations of our platform. Collaborated to obtain solutions that were secure, compliant, effective, and efficient.
- Promoted uniform monitoring so all engineers could see how well existing systems were running
- Consulted with multiple other teams to provide secure and compliant solutions for specific needs.
Compliance
Enabled our compliance team to achieve and maintain SOC 2 and ISO 27001 certifications. We also maintained a clean separation for data which fell under the GDPR.
- Wrote custom scripts to probe the boundaries of a complicated deployment
- Acted as security point-of-contact on the CloudOps team
Production Access: AWS SSO
Most of the effort involved in the AWS SSO Migration was ensuring that any new solution satisfied existing access requirements. Given that the environment setup was legacy and under-documented, this was a substantial challenge.
- Worked closely with our Corp IT team to use our Okta instance for athentication
- Iterated to ensure the SSO roles were sufficient but not overly broad
- Migrated all ad hoc users to AWS SSO for all AWS accounts
Airtable
Mountain View, California, USA; November 2020 — June 2022 (1 year, 7 months)
Production Engineering
At the time of my departure, Production Engineering was still the only 24×7 rotation within the company; while daytime alerts were directed at more specific teams, we were the only ones available outside business hours.
- Was primary oncall for a 24×7 week-long rotations with a 5-minute time-to-action SLO
- Handled incident response (ongoing management, postmortems, reviews)
- Triaged (and sometimes handled) miscellaneous requests from other users ("interrupts")
- Updated documentation (playbooks, checklists)
- Mentored teammates as they went oncall
- Ran "table top" oncall exercises, complete with postmortem writeups
Compliance
Airtable maintained SOC 2 and ISO 27001 certifications. Keeping these certifications required regular work; some scheduled (e.g., review who has access to which systems every quarter), and some on demand (evidence gathering, security patching).
- Was technical contact on the Prod Eng team for Compliance team
- Performed and streamlined Quarterly Access Reviews
- Provided evidence for SOC 2 and ISO 27001 audits
- Helped remediate security concerns
Production Access Onboarding / Mentoring
Airtable restricted access to sensitive environments to a small number of engineers. This access required a separate laptop and specific Security Team approval; coordinating that process for dozens of users required documenting, revising, and finally optimizing the steps required. (This was especially true as the duties that used to be on a single team were spread out to almost a dozen.)
- Enabled dozens of users to have "full production access", including training
- Evolved the onboarding process: optimization, documentation
- Handled tool evolution
Google Inc
Mountain View, California, USA; October 2013 — July 2020 (6 years, 9 months)
YouTube Trust & Safety SRE
Original Tech Lead for the Site Reliability Engineer (SRE) team formed to support and productionize the YouTube Trust & Safety tools (for managing abuse, fraud, child safety, etc).
-
Founding member of a new team of 2 SREs, which grew to 5 within a year:
- Created infrastructure (permissions, mailing lists, group memberships, etc)
- Established team culture
- Initiated regular meetings with product developers
- Reviewed incident postmortems alongside product developers and managment, and helped clear a backlog of prior incidents
- Investigated multiple aspects of existing developer-supported systems, then presented that knowledge to multiple groups as well as to the new SREs
- Explored multiple options for consolidating and hosting those systems (investigation and initial feasibility)
- Assisted emergency response to the emerging COVID-19 situation
YouTube Search & Discovery SRE
First member of the SRE team dedicated to managing YouTube's “content discovery” systems: Search, Personalization, Watch Next, Recommendations, etc.
- Worked closely with developers to deploy a custom instance of the Google WebSearch technology stack for YouTube content
- Optimized that search stack by using more advanced container / cluster features (saved ~20% out of O(1M) CPU cores)
- Senior member of a 24×7 oncall rotation with a 5-minute response SLA
- Mentored 10+ new/junior SREs to full solo oncall capability
- Managed services deployed globally on millions of CPU cores: cluster migrations, organic growth, feature launches, multiple releases per week
- Handled multiple large public-facing incidents, including postmortem creation, analysis, and followup
- Worked closely with our sister team in Zürich, Switzerland (multiple in-person trips, weekly video conferences, daily status handoff emails)
- Co-owned responsibility for our “panic room” (providing privileged access to production networks in case of an on-corp / in-office network outage)
Internal Consulting, Educating, Mentoring, and Interviewing
This wasn't a distinct role; instead, it calls out the areas where I specialized and providing extra value to my teams.
-
Mentored many peers on a 1-to-1 basis:
- Brought 10+ SREs to full solo oncall ability
- Maintained a list of resources for new SREs
-
Supported other oncallers during their shifts:
- Was often the designated contact person for new SREs during their first few solo shifts
- Was the YouTube SRE group expert in multiple technologies (e.g., Search, Laelaps, GoogleSQL).
- As a senior member of the overall team, often helped manage and resolve massive incidents involving multiple shards of YouTube SRE.
-
Developed, presented, and trained others to present courses on topics including:
- GCL — a custom configuration language with highly unusual semantics
- GSLB — Google's global load balancer, which routes trillions of requests per second
- ST-BTI — an obsolescent storage and indexing solution which a team wasn't yet able to migrate away from
- YouTube Search — a custom instance of the Google WebSearch stack
- “Going On-Call for Developer Rotations” — introduced hundreds of developers to the principles of going oncall
-
Helped fellow Googlers across the company with questions in my fields of expertise:
- GCL — the most widely used configuration language within Google
-
BigQuery / GoogleSQL — Google's implementation of
standard SQL on top of petabytes of protocol buffers
(and
gqui
, an ad hoc query engine for those same files) - Production Management — especially with rarely-used edge cases at the intersection of virtual and physical machines
- Regularly received “peer bonuses” for this assistance (being beyond my regular job duties)
-
Interviewed 50+ candidates:
- Specifically volunteered for interviews with candidates from more diverse backgrounds (and received a “peer bonus“ for using inclusive language in my feedback)
- Became a “calibrated” interviewer (my scores were generally aligned with other interviewers' and ultimately with those of the hiring committees)
-
Helped build and maintain team culture and cohesiveness:
- Researched and initiated inclusive events
- Planned and ran multiple off-site activities
- Took photographs to share with the team, organization, and company
Foiani LLC
Albuquerque, New Mexico; July 2009 — October 2013 (4 years, 3 months)
Universal Non-Destructive Assay Platform: Software Architect / Implementor
- Worked with a multidisciplinary, international group including:
- Adapted Linux and supporting libraries to custom embedded processor
- Designed and built custom software for realtime data acquisition
- Provided high security data transfer and device configuration
- Wrote and generated in-depth API / extension documentation
- Advised a team new to Linux and many other current technologies (XML, HTTP, TLS, NTP, etc).
- Integrated many technologies while creating a long-running
unattended data acquisition system, including:
- Busses: I²C, USB, PCI
- Security: OpenSSL+OpenSC+PKCS11, tamper sensors
- Web-Based UI: HTML, JavaScript, AJAX, CSS
- Real-Time Processing: Threads, Watchdogs, Optimizations
- Worked with electrical engineers and digital designers
- Provided project administration: version control, builds, and release management
Yahoo! Inc.
San Diego, California; October 2004 — July 2009 (4 years, 7 months)
Worldwide Data Distribution System: Architect / Implementor
- Adapted existing system for serving data to hundreds of thousands of nodes around the world
- Documented a substantial corpus of existing code
- Created a system for publishing that documentation to company standards
Mobile Entertainment Provisioning: Architect / Implementor
- Put Yahoo! Music onto mobile phones:
- Multi-tier architecture (J2EE, Tomcat, AXIS, AJP)
- Multi-client presentation (WAP, XHTML, JSP)
- Dealt with non-traditional (Mobile vs. PC) browser factors: memory, display, latency, bandwidth
- Used custom packaging / deployment technology (similar to RPMs); became site expert on technology (out of 70-100 engineers)
- Pushed technology envelope within a large company
- Early Linux (RHEL4) adopter
- Early J2EE (Apache/Tomcat/AXIS) adopter
Backoffice Data Reorganization: Manager / Architect / Implementor
- Re-organized some 60TiB of business-critical live data
- Coordinated 10 people doing various aspects of necessary work
- Wrote before/after comparison scripts to validate the operation
- Designed and wrote helper utilities to make the motion transparent to client processes
Miscellaneous Knowledge Sharing
- Shared in-depth knowledge of Perl and Unix on company mailing lists
- Facilitated a “add your own map” mashup on the Y! Maps site
- Implemented a fast graph search algorithm for a remote colleague
- Contributed to various open-source packages
MusicMatch.com
San Diego, California; November 2001 — acquired by Yahoo! in October 2004 (2 years, 11 months)
Digital Audio Processing Engineer
Inherited and extended a distributed audio processing system:
- Updated and enhanced a heterogeneous cluster of processing nodes
- Rewrote DSP core in C++
- Handled multiple standards (MP3, AAC, WMA; DRM / no-DRM; tagging)
- Managed 60+ terabytes of audio data and associated metadata
- Helped build rules-based metadata engine for popular and classical audio tracks
- Helped evaluate various encoding / processing schemes
Database Application Programmer
- Worked with Oracle (versions 8, 9, 10)
- Wrote, debugged, supported, and optimized DDL, DML, and bulk loading
- Implemented and supported a
mod_perl
-based administrative interface (including DHTML features)
Streaming Digital Audio Engineer
- Generated weekly metadata builds providing streaming audio to 30M desktops
- Optimized legacy system to accomodate 100x original design capability
- Helped scale related subsystems
- Extended and optimized browser-based administration tools
Miscellaneous Knowledge Sharing
- Supported and extended existing systems
- Answered Unix / shell / Perl questions
- Optimized database queries
Previous Jobs
Older entries have been moved to the historical file.
Publications
The Perl Journal
-
A Web Spider in One Line?, Fall 1999
(republished in The Best of The Perl Journal, Volume 2: Web, Graphics, & Perl/Tk Programming )
Education
Bachelor of Science
Bachelor of Science in Computer Science and Math, with a
minor in German.
New Mexico State University
Las Cruces, New Mexico
Date of graduation: May 1995
GPA: 3.00 out of 4.00
Activities
Community Contributions
- My fairly new blog
- I have some repos on GitHub
- My ancient perl samples still get some traffic
Groups
I follow and contribute answers to many lists, including:
- Boulder Linux Users' Group
- Northern Colorado Linux Users' Group
- San Diego Perl Mongers
- Portland Perl Mongers
Projects
I contribute answers and a few patches to many lists, including:
- Crosstool-NG
- Boost C++ Libraries
- OpenSC (Smart Card support)
- Linux (Kernel, USB, PowerPC, Networking, Filesystems)
References
References are available upon request.