Releng 2014 – Keynote 1: Chuck Rossi, Release Engineering, Facebook Inc. | Talks at Google
Articles Blog

Releng 2014 – Keynote 1: Chuck Rossi, Release Engineering, Facebook Inc. | Talks at Google


BRAM ADAMS: Welcome, everybody,
on this relatively sunny day at the Releng 2014. So there will be lots of
people here, around 100. So this was a huge organization. So there’s quite some
people who worked on this. Not everybody
could make it here. Chris and Kim, unfortunately,
could not be here. But there are some other
people in the room, like Foutse, for example,
Stephany, she’s still, I think, outside, Boris
is just coming in. And then there’s Akos as well,
who’s probably also outside and coming in soon. And I’m Bram, Bram Adams. OK, cool. First of all, we’re
at Google right now. And this would not have been
possible without some people internally who have
helped us quite a lot. That’s one of them, Boris. And Akos is the other one. So I really would like to thank
them and people who are here, like Dominic and Eugene,
and so let’s thank them first for helping
us set this up. Cool, OK. Now, Releng. A while ago, I did my
yearly chocolate pilgrimage in Belgium. And I ended up in Brussels at
some conference called Fosdem, which is about open
source development. And I wandered the corridors. I could hardly pass. So these are lined up–
more people lining up to go in the room. And the room was full
and blocked by somebody in the Puppet Lab t-shirt–
no correlation there. And then I was trembling
with trepidation there, and then I took a picture,
which is blurry on purpose. And it shows “Full”. And this was a session on
configuration management. Now what was that? Well, this is a session about
all things release engineering, actually. They’re talking about
deployment clouds, enterprise configuration management. It’s full. And these are all people want
to learn these open source technologies supporting that. And then I said, yes this makes
sense, because I saw this blog post a bit earlier by this
gentleman here, who said, well, you know, continuous
delivery is mainstream. And he got a lot of backlash
there, which is weird because we just saw all
these people lining up. They want to use
continuous delivery and all these fancy [INAUDIBLE]
technologies. And this backlash
was like and update. And his point was
actually that since 2010, people like a Facebook,
Google, Amazon have been doing continuous
delivery– all these things. So we’re four years later,
so this should be mainstream right now. But he made some formulations
that caused some controversy. And later on, there
came this blog post that actually nailed
it down, exactly. And I want to zoom in exactly
on what this blog post said. What it says is
actually yes, people want to really apply
continuous delivery and all these
[INAUDIBLE] engineering techniques in their situation. And that’s where the problem is. Because how do you do that? It works at Google. It works at Facebook. They did lots of
effort through this, and made some mistakes
along the way. How does their work
apply to other companies? OK. For example, how
can you actually get buy-in for your management
to spend effort to get there? All these kind of
thing– other techniques you can use, what are the tools? So bottom line, you have a bunch
of people in the industry who want to apply
continuous delivery. Plus you have a whole bunch
of these guys– researchers who want to help you,
who want to prove that continuous delivery
helps, it improves quality. So what happens if
you combine both? Then you’re here, at Releng. That’s the goal of this
workshop– people talking about experiences, how they
can go to rapid release, and all these kind of things. We have researchers who want
to help, who show some results, want to get ideas
for further research. And that’s basically
why we’re here. Right here. OK, cool. So now the workshop. What will we see today? Well we had lots of submissions. We don’t get to five, not
ten, not 15– 18 submissions, which is quite cool. Even cooler, especially
given what I just said, is that half of them are from
industry and half of them are from research, exactly. So from these 18, 16
will be presented today. You’ll be seeing them. And you can interact and discuss
and these kind of things. And now, we had a
very busy schedule. So we had one week for
people to actually review all these submissions. So some people read it like
huge effort here in one week. They do this thing and
even discuss things online. So really let’s thank these
people for doing this work. BORIS DEBIC: All
right, everybody I want to welcome
you again, and I want to welcome our first
speaker today, Mr. Charles Chuck Rossi, off Facebook. I know him– yes,
give him a hand. Give him a hand. He is already a release
engineering celebrity of sorts in Silicon
Valley for many reasons of which I won’t
come and talk about. But I know Chuck from Google. He used to work at Google. In 2008, Chuck joined
Facebook, and he’s working in the release
engineering group at Facebook ever since. As you probably know, Facebook’s
application on iOS and Android is the most popular
application on the planet. And those guys who
work with Charles, they release this twice a day. So he has a lot of experience
in how to push change lists and bugs very fast
into production. I would like him to
share with this group some of the stories
and war stories and some of the
approaches that they take at Facebook to
make this a success. Chuck, please. CHUCK ROSSI: Thank you, Boris. I hired Boris at Google. I learned a lot about
hiring after that. Yeah, sorry for that. So I want to talk
a little bit mainly about what we’re doing
lately at Facebook, and it’s all about mobile. I’ve talked a lot about
the front end release process, the facebook.com
release process, which Boris says we’re a little bit
famous for because we do it twice a day, every day. Facebook.com rolls on the new
code every day, twice a day. I’ll give you
inside information. It’s around 8:00 am
Pacific and around 4:00 pm Pacific is when the
whole site rolls. Anywhere between 30 and 300
cherry-picks go out per roll. It’s a quasi-continuous
deployment, with well over 1,000
engineers touching it with over a billion people
being affected every time we push that button. Some of my release
engineers are in the back. They’re a little
nervous because it’s about time we should
be rolling it. Hopefully, everything’s good. If you see Facebook go down,
somebody wave their hands like, it’s not working. Let me know. The genesis of this talk of
and the talks I give really came– I got to give credit to
John Allspaw and John Hammond. At that talk, I think we’ve
all seen it from the Velocity conference from–
when was it– 2009, where they define
the dev/ops thing. And we had been doing this
organically at Facebook since I got there in 2008. And it gave me a
voice and name to call this thing and a
sledgehammer I could use for the developers
coming in saying, this is how we do stuff. It’s been validated
by these guys. This is the way we’re doing it. The thing I got from
that talk– what’s the take away from that talk? This slide. Right? “No.” We’re all going to look like
that as release engineers. That pretty much
summed up my experience up to that point of
being a release engineer and saying this is I operate. But it changed. And for mobile,
it changed a lot. So the dev/ops movement and
everything we learned at web– I consider web delivery
a solved problem. For Facebook, it is
a solved problem. We’ve whittled
down the team that supports pushing those
300 changes a day down to aspectively two people. Two people run facebook.com
from a release engineering point of view. Then we decided, OK,
we’re a mobile company. And this became a problem,
because we threw out everything. All the good stuff
that we had learned, all the good things that
we had from so many years of building up to a continuous
delivery system and all this dev/ops crap was great. And then mobile came
along, and it’s dumb. It threw away everything. We had to start again
mostly on the culture side and on the thinking side. And this was unfortunate. Now, what we’re
dealing with here– and Boris alluded to this–
is a scale that is big. On IOS, we’re the
number one app. And we are number
one because there are some percentage
of a billion people who run that app on their phone. I can’t tell you the
exact split because I don’t hurt anyone’s feelings. Those are monthly active users. Think about this–
over 700 million people will use the app today, alone. There’s about 300 plus
engineers working on it. There are many features. I will make this case here. I defy you to find a more
complex app on the platform than the Facebook app. I don’t think there is an
application as complex that uses the full stack of the
phone as Facebook does. We got to support multiple
devices, even on iOS. And remember, there
is a web component to most mobile heavyweight apps. We are delivering web and
backend endpoints and web endpoints to deliver content
and experience to the phone. That’s iOS. Same kind of story on Android. It’s the number
one non-Google app. The number one app people
choose to install is Facebook. Again, it’s some percentage
of the billion monthly in the 700 million daily users. Again, about 300 engineers
and again, the same problem. The multiple devices problem is
a bit more severe with Android. And I’ll get into that. And again, web component,
you got to worry about. So fundamentally, the
thing that gets us is if we have problems
on the website, if you have a fatal
on facebook.com it looks like this. So you can’t see, but there
is a fatal on that page. Something didn’t render. And that’s a php bug,
and I got to fix that. That’s a fatal. I got to fix it right now. It’s a rendering problem. But you pretty much
have an experience. If you crash on any mobile
thing, what’s it look like? Boom, you’re out. You’re done. Your user experience is over. And you can crash for any
number of reasons on mobile. If you develop on mobile,
you know this very well. And it’s miserable. And people hate mobile. The user satisfaction numbers
of web versus mobile, mobile’s in the toilet,
because the experience is A, out of our
control many times. And B, we can’t recover,
do exception handling, or gracefully exit
when things go wrong. So we’re under much
more scrutiny on mobile. So we want to do,
though, with our release process– the main thing
is as release engineers, we are here to make
the company successful. We have to maximize the rate
at which our company can do great things– all of us. Our companies want
to do awesome things. Our developers want to
ship their cool code. We are there to facilitate that. We have to make it happen. At the same time,
we’re also responsible. We’re the adult supervision. There has to be some
sort of quality metric, some supervision, some idea
of, are things better or worse if I push this button? And we are all pushing
that button to say, I say this is
going to be better. We have key metrics on
mobile that cannot regress. TTI– Time To Interaction,
crash rate, star rating, things like that
cannot go backwards. And that’s important to us,
and as release engineers we pay attention to that. So mobile is different. Let me talk more about some
of the things that bite us in mobile that you’re
not used to on web. There are no daily releases. Those of us who, if you want
to release your packaged software– as I said,
I was at VMware, and we released stuff on decent
schedule as package software. It was up to us when we release. Web, we release
continuously, right? What do we do on mobile? Nothing. Pick up any iPhone in
this room, go home screen, and look at the home screen
at that stupid app store icon, there will be a double digit
red number in that box. Why do I, like a monkey, got to
go push that button every day? To get my little thing
back, and say, OK, do that, do that– mindlessly
pushing that button to tell it to update. The worst thing in
the world, especially as a release engineer, because
you have no control when you’re going to
push that button. My mother’s is probably
a three-digit number in that thing. So this is a major, major crisis
for you as a release engineer. Now, iOS 7 got it right in that
you can turn on auto update. They didn’t make
it on by default, which I think it was a mistake,
but maybe the next release will sneak that on. We have to get away from this. Android has a long way to
go to make this better. We release every
four weeks on mobile. And that is fast
for a mobile company with hundreds and
hundreds of developers working on a on
a billion people. I’ll get into details of
what that flow looks like. But we can talk about what
your release schedules look like for your mobile
apps, but I think four weeks is relatively
quick for mobile. The other problem–
when we release software as release engineers, do we
build our bundle, our website, or web stack, whatever is– push
a button, and 100% of everybody gets that? When I push the button
at facebook.com, do 1.25 billion people
instantly get my new binary? No. I do a slow rollout. I push the web to 2%. I get data, looks good. I push out the rest. In mobile, what do i d? I push a button. It goes to the app store,
the black hole that is Apple. And out comes, in some
indeterminate amount of time, my binary. A billion people, or some
percentage of a billion people are going to be slam
bam, you get this binary. If I have made a
mistake, or if there’s a fatal, or something
silly an app, it’s gone. That bullet has left the barrel. And I’m screwed. There is a little bit
of hope on Android in that people who do allow
automatic updating– and this is a huge great
feature for Android– is I can say, update to 5%. We did a push yesterday. We released our Android app. And how we did it–
we said go out to 5%. And we get data. And we say it looks
good, ramp it up. But that’s only people
who ops in and go through the nightmare
of checking off all these boxes that are
buried in various places in the Android operating system. We have to ask permission
to do something. So again, if I want a hotfix, I
have a crisis, the first thing you want to do as
release engineers, you fix that problem, right? You don’t do that in mobile. You make a nice package. If it’s Android, you have
hope that you can get it into the store if the stupid
thing will upload correctly. I’ll get into that. And then if it does get
there, you can get it out. But then, even if you
got it out there quickly, what are you going to do? You just got to wait for people
to click the stupid button. And Apple, it’s even
worse, because if you happen to have your hotfix
in the middle of Worldwide Developer Conference, when
the intern takes your app and puts on a USB sticks and
takes it somewhere does that. If it takes them
threes to do that, your hotfix will
sit for three weeks. Anyone work at Apple here? Good. I can keep talking. So as release engineers, these
are serious problems for us. And you have to keep this
in mind, because now, with continuous delivery,
you don’t sweat these things. But now, it’s the opposite. Like I said, you threw
that all the way now. And now, you have
these problems that are a real nightmare for you. Some idiot shipped the
wrong icon for the iOS app. I did that. So there’s no worse feeling
knowing you did something globally that’s in the news,
because of a stupid icon. And there’s nothing you
can do to get that back. So keep that lesson in mind. Permanence– all those
little bullets that you fired are still out there. This is a little slice of
what people are running. This is telemetry
from our Android apps. Those are the
versions of Android running in production on phones. What do I want
everyone to be running? I want them up in that
green section there, in the upper right. What are they running? A vertical slice of crap–
of 20 versions of old stuff I don’t want them to run. My mom is somewhere in that
red line at the bottom there. And so they complain the
experience is terrible. Of course the
experience is terrible. You’re on a version
literally 16 releases ago. So this, again, is going
to be your reality. Testing these things– so I’ll
talk a little bit about that but especially on
Android, you have this. This is a heat map of the
devices sending telemetry back for our Android app. There is a long tail of
crappy little Android devices that will never die. Technically, your app needs
to be tested and run on all these physical
hardware things. And again, out of your
control and something you need to consider. I’m not even giving
you the vector of which version
of Android they’re running on these phones– Froyo,
Gingerbread, KitKat, Jelly Bean, ICS– all those, we could
put another matrix in there, and your head would explode. And you know darn well that
something that works in ICS is not going to do
well on Gingerbread and a million other
permutations of that. So this is something else
you need to worry about. It’s nicer in the
iOS environment, because it is a bit more
unconstrained with devices and whatnot. But supporting the iPhone 4
is not as easy as you’d think. So we do need to worry about
how this works on older and different iPad
and iPhone devices. So how do we ship this code? So what’s the process by
which we’re getting this out? So like I said, the
web is well known here. Just one thing on
organization– this is a big thing we
could talk a lot about. But the normal thing you do
is you have your normal web deployment world and your
development environment. And your engineers have
your desktop web guys. You got your product experts
for the product itself, and then the mobile guys tend
to be platform experts, right? So they’re shoehorning stuff in
because they know the platform. But they don’t know
what the messages, they don’t know photos,
they don’t know– whatever functionality
they’re working on, product feature, they
might not be the expert. We started out this way because
it’s naturally what happens, right? What you’ve got to get
to is obviously this. So we have no more mobile group
and web group and all that. If you work on the chat group,
you work on the chat group for all platforms–
web, mobile, whatever. And there was a bit of
a organizational noise that went and shuffled around. But it was worth it in the
end, because when it settled, we had this. And we had less cloudiness
on the mobile side. And our mobile quality
improved greatly because the features were
done by the people who know the features, who know
that thing what they’re trying to do,
regardless of platform. When we did that, the
number of developers we were supporting
on mobile kicked up. And this is an actual
graph of the number of unique individuals
checking in the code into the mobile code bases. And after the re-org, bam. So as release
engineers, we have just multiplied the volume of
stuff that we’re dealing with. Just be aware of when
you go this model, that’s what you have. So this fixed-date release
process– Facebook uses it, Chrome, a bunch of other
people use this process. It’s not ideal. You all know trying to
get software engineers to hit a date is like
trying to give a cat a bath. It’s just not– it’s just
fighting the whole time, and they never hit the date. So while we don’t love it, this
really works well for mobile. And we’re trying to do
things to optimize that. When you have a
date-based release system, what are you trying to ship? You have three
things you’re trying to worry about when
you’re shipping software. You have the features,
the quality of the code, and the schedule that
you got to worry about because it’s date-based system. When you’re under this
kind of constraint, you got to pick two of these. Which two do you pick? You pick quality and schedule. Those are the two things,
as release engineers, that we focus on. Have we regressed anything, and
are we going to hit the date? If it’s a feature issue,
it’s not the priority. Why? Because we ship on time. And again, this is where
we have the most conflict with engineering
with developers, because they’re crappy at
hitting time, hitting dates. The good news for
this is you don’t have to wait if you
do get your stuff in. So if you do have a press
release, or a major feature announcement, or whatever
is you’ve got out the door, you know it’s going
to go out that day, because we’re going
to kick out anyone who doesn’t fit the bill. So the good news is if you
do have your act together, and you do get in, you
will be in good shape. Your stuff will go out. We’re like doctors– do no harm. So if we do something, we
cannot make things worse. And that’s the sum
of the criteria we judge every commit by. Are we making things better,
or is this just an iffy thing that will possibly
make things worse? To engineers, four weeks
seems like a lifetime. To PMs, four weeks
seems like a lifetime. Four weeks is not that long. And you know there’s
another one coming. Those trains are always leaving. Do not freak out when
we throw your thing out. I had a team come to me at week
three of a four-week cycle. And they’re like, this is big. We got to get this in. It was literally 15 changes
to the main photo flow in the mobile app. And we’re like, no,
get that out of there. We’re not taking this
at this late date. We’re almost ready to ship. You’re nuts. No, no it’s a high priority. Zuck wants it. It’s got to go in. And we escalate. Boom, boom, boom, boom, boom. All right, let’s
get someone more important than you and
me to talk about this. And I eventually
get the thrown out. So we ship. We’re good. The next cycle comes up. I go to that team
about two weeks in. I’m like, hey, you
guys get that stuff in? We’re going to check it out. They’re like, nah, we’re going
to wait for the next one. So it was like,
you’re killing me. You wanted to get in. It was the most important thing. You’re going to wait another
cycle because it wasn’t ready. So as a release
engineer, you have to have that sense
of– these guys are not going to land this. And you’ve got to assert
yourself and say, like listen, there’s the next train. You’re on it. Get off my train now. Things that break aren’t ready. Get them out. Don’t waste time fixing
forward or taking more patches on top of more patches. Like, OK, I know I
gave you those three diffs and those
three cherry-picks, but take these three more. It’ll fix it. I promise. Use your judgment. But literally, do not
let them walk over and keep dumping in
and fixing forward. Just like, no you’re done. You’re on the next train. Get out of here. I got more stuff to worry about. You’re just annoying me now. Put your mean man face on
from slide number two there. OK, let’s talk a bit
about the mechanics. This is our web
development cycle. So we have our source
control system. We use Mercurial, Subversion,
Git, I don’t care what we use. I hit them all equally,
so it’s not a big deal. You’ve seen developers screw
up source control many ways dealing with branches
and complex things. I think a couple guys
from VMware are here. I set up the system at
VMware back in the day. That was a hard problem–
many long-lived branches with dot releases
of many products. We had a really good
system under Perforce, that when you check
in, it asked you where you wanted
your stuff delivered. It would deliver it,
check in, build it, let you know it went
in, blah blah blah. It could not be
simpler at Facebook. No matter which crappy source
control system you use, you check in the
master, you’re done. OK, that’s all they have
to do as developers– get fricking code into master. What we do in web is after
a week of development, generally it’s
Sunday at 6:00 pm, we cut a release branch,
a simple release branch. From Sunday to Tuesday,
during that blue period there in that blue
box, we stabilize. We test it internally. We make sure it’s good. If everything’s good, Tuesday at
around 4:00 pm, that goes out. That is between 4,000
to 6,000 changes that went into trunk that week. For the rest of the week,
that’s my twice daily push, during that green box there. And that’s where I take my
30 to 300 cherry-picks a day. That flow has been
the way at Facebook for six years– has not changed. The big win for this is that we
ship twice a day, we’re fast. That little blue box in there
is like internally, we’re dogfooding before
anyone sees it. Again, we’re not waiting for
anyone, because at that rate, it’s like, if you don’t hit
today, there’s tomorrow. It’s like hours away. It’s not the end of the
world if you don’t make it. The engineers are there
supporting their changes. It’s true dev/ops. Your change doesn’t go
out unless you are there. I won’t push your web change
unless you show up and show me you’re still alive. And there’s clear rules. We all know it. There’s an on-boarding where
I brainwash all the new hires. Like, this is how we do it. This is what you’re doing. We’re all on the same page. So let’s take the desktop web. Let’s overlay now
what we do on mobile. Not very different, the
time scale’s changed. So now we have four weeks
of development in master. And at the end of four weeks, we
cut our simple release branch. And that release branch
lives for 3 and 1/2 weeks under our eyes. And that’s where we take more
cherry-picks, probably between, I want to say 120,
150 more cherry-picks will come in to stabilized
after three and a half weeks. And then that green
period of soaking– like, don’t change anything for
three days internally. Just dogfood it for three days. Let it accumulate state. See what breaks, and
see if we’re good. At that point, that
green line is our fourth. The day of our fourth
week– exactly four weeks from that first red line,
it goes out the door. What I want to keep at
Facebook, is no matter what group you’re in,
front end, back end, mobile– this little
picture is your life. And for the most part,
this is true at Facebook. No matter what
group you’re with, you will release with a simple
branch cherry-pick system. The time scale will change,
and some of the mechanics will change. But I’m a big advocate for this. We can’t do true continuous
where we can just deploy from trunk, but I
like this little buffer zone of having the cherry-pick
system with a release branch. It’s worked very well. So we haven’t changed things. Like I said, if you
changed from web to mobile, you have the same thing,
except now some of the times have changed. Otherwise, all the
same things you’ve learned, all the operational
awareness you’ve built up as a developer is
still with you. On Android, we have
one special tool. And God bless Google and
Android for doing this– is the alpha and beta program. So I want more eyes on my stuff. And I do this with facebook.com. The website, like I said,
I can leak out stuff to 2% of the user
base at any time to get feedback
of what I’m doing. But I have the beta
program on Android. My beta program is
a few million people who volunteered to get the beta. The beta comes
from the blue line, which is the release branch. I have more beta customers
than most people have users. Every Monday,
Wednesday, Friday, I ship whatever’s in the
branch to these people. Obviously, auto update’s
an important thing for these people. They all have it turned on. So bam, bam, bam. I get that. And I get telemetry
back immediately. So I can now
analyze what they’re seeing, what crash rates,
what the logging looks like, what bugs they’re reporting–
all this good stuff. We were really happy
with the beta program, and then Google announced
the alpha program. Wrap your head around
this– I am shipping trunk to a few hundred thousand
people every night. That should scare you. So the Android app– if
you’re in the alpha program, you will get pushed every
night whatever’s in trunk. There’s some certain things we
do to ensure that nothing leaks and that we’re in good shape. And I’ll talk about
some of safeguards. But that’s really cool. All right, let’s get
in to the developments. What are we doing in
that 3 and 1/2 weeks? Because we take that
full time to figure out if this thing’s going to ship or
not, or if we’re in good shape or not. So let’s talk about some
of the details there. These are more the philosophy
in that release branch. How do we keep that release
branch in good shape? The biggest thing, like
I said, is no features. If you take a feature, you’re
basically resetting the clock. We’ve done all this testing. We’ve had all the dogfooding. Everything that comes in is
resetting in destabilizing what we’ve done. And honestly, if it
didn’t make that cut, we’re assuming it wasn’t
ready when the cut came and you’re not going
to cram it in later. You can’t just worry
about native code. And it depends on your
app, but if any of you have any kind of at that does
anything of significance, you’re going to
have these issues. We are very picky about design. And there’s a design team
that in mobile, you just can’t throw in an
element, or something, or change the UI without a
pretty heavyweight analysis of is this the way
that we want to go. Did we get logging data? Is your logging in there? Are we getting data
from dogfooding that your thing is working
and turned on and good? Are there server-side
endpoints, or updates to the website that
need to roll out before your thing
can be turned on? So make sure we coordinate that. The worst thing in the world
is pushing something out, where it starts hammering the
backend because they didn’t realize the use was
going to be like, 10x what they thought it was,
or the endpoint’s not there, or it’s not at the right level
to make the right response. We have big privacy and
legal issues, as we all do. There is a team for a release
that looks at what’s going in, and says yes this is good. If you rush it, or you
take in late changes, you put that risk
because they could derail what they looked
at when they first said, OK, this is what’s going
on for this release. And yeah, basically if you
are not testing in master, we want that– we want it
vetted in master before it gets in the release branch. So if you’re putting it
immediately into the release branch as soon as
you check it in, we don’t have that
window to do our test to make sure that– we
still run the tests, but I want it going
through master to release for more sanity. You are guilty until
proven innocent is a pretty much our motto here. So every time you do
ask for something, we need to approve it. And we use a cherry-pick
system, again, this is across all of Facebook. It’s called Relief. It’s part of Fabricator,
which is that our whole code review, our whole stack is
open sourced under the guise of Felicity, which
is a company that has all our internal tools. They’ve open sourced it
and they maintain it. Within Frabricator is
this thing called Relief. So if your diff is
accepted– and this is a diff that was
accepted and is in master– that link shows up, and
it ways Relief Request. And what you’re saying
is I want this to go out. I got it into master, but I
still want it in the release. And you’re in week
two, or three, or whatever of the process. You click that
button, and out comes this page, where you tell
me why we taking this. And on the right
there, you’re going to say, yeah, we’re
taking this because this is a really bad bug with
display model blah blah, blah. Boom. My release engineers are
going to look at that, and say, yeah this
is legitimate, or this smells fishy. The other thing we have is over
here, we have these two lines. The top line says Size. It’s the size of the diff. As that diff a big
diff– number of changes, number one lines
added, deleted, moved, whatever– that bar will grow. The bottom line is Churn. And that’s the
amount of discussion there was in the diff. How’s the diff go? You send out your diff,
hey here’s my diff. And you go, you diff
sucks and so do you. They’re like, no, you suck
and so does your mom– and back and forth
and back and forth. Maybe those are just my diffs. So that will get
bigger, and bigger, and bigger as there’s
more rejections, changes, to that diff. So If i see big bars there, I
know there’s some contention, and I want to take a look. The other thing which I’ve
blacked out is the Karma. And right under there are stars. And all engineers start
out with four stars. You can only see your own Karma. But if there is a bad thing
that happens, I push this. And I’m the only
person on Facebook who has this symbol, which
is the Dislike button. So if I push this, it
means we got a problem. And a box opens
up, I type in what happened, how we can
improve ourselves. I click Submit. It goes to me. It goes to you. It goes to your manager. And it goes to our work.com
performance review tool. So it’s very much
a public shaming, not a private– I’m sorry. It’s very much a private
shaming, not a public shaming. You really want to
avoid public shaming. I do this in my head. You cannot stop me
from doing this. OK. When I was at Google, I
could walk down the hall and say like, two stars, three
stars, four stars, two stars, one star, zero star. Right. Where’s Boris? There. So you know you do this
as release engineers. And it’s your job. You’ve got to manage risk. And when you see a
room full of engineers, that’s a room full of risk. So but when it got to
300, 600, 800 engineers, I couldn’t do it anymore. So we made this system. Now, this is not like
some punishment system. Nobody’s ever gotten fired
because they got two stars. But it will be really
valuable to you as you get to
people on your team, and as you start dealing
with different teams to remember– you
look at this diff like why does that
guy have two stars? Oh, yeah. I remember now. So you’re going to take a
little bit more extra care and see what’s going on there. And that is part of Relief. You can download this
and use that as you wish. Again, we are risk averse
on the release branch. We need a reason to approve,
not a reason to reject. And then, like I
said, the main reasons here are there’s pluses and
minuses for every change. You have to let your release
engineers have this flexibility to be able to make a judgmental
call, very subjective, to say like, this feels right,
this does not feel right. And there can be some criteria
that you can spell out for it– help them with
that, or help publicize it. But we’ve been doing this
as release engineers for six years at Facebook, and
the push back is very low. We are very well-respected
as release engineers. And make sure, in
your organization, that you are respected for
what you do– for your judgment and for your skill set. And given that, when release
engineering says it’s not good, it doesn’t go. So make sure your management,
your organization, your culture at your company
backs you on this. Finally, there’s a great
quote in the movie “Ronan.” Robert De Niro says,
“When there is any doubt, there is no doubt.” And that’s our motto. If we get queasy on anything
in that really branch, and we don’t feel
right in mobile land, it goes out– an
important thing. Let’s talk a little
bit about the tools. Let me check my time. So the tools– I need to
get master in better shape, especially because
on Android, I’m shipping the thing every night. So I need some tools that when
the developers land their code into trunk, they have
confidence that they’re not going to burn themselves. Because they’re
very attentive now. I think the developers have a
very good operation awareness. They want to do right. They want to be happy. They want to land that code. They don’t want
to cause trouble. They don’t want to
lose that precious star and their Push Karma, so
they will pay attention. Give them the tools
to do it for them. This is our continuous
integration stack time on mobile. We use Buildbot. You can use whatever you want. Buildbot works for us. But basically,
you have the build part, which is we
build everything. When you check in the mobile–
what is Facebook mobile? It’s not just one app, right? It’s the native app for the
platform, so iOS or Android. It’s messenger. It’s pages manager. It’s Instagram. It’s a bunch of other stuff
that’s going to launch or has launched. It’s new projects. There’s a long list of those
little boxes of squares on top. For every commit, you might
break something across the way. We use a monolithic
code base, much like Google and other places. So you really need to
check all the builds when you check something. I just fixed something
in Facebook for Android, but you just broke messenger. So we do all those builds. On the way in,
there’s a whole series of lint/static analysis–
things that check are our policies being followed? The easiest one
is the regex one. Anyone can write a regex to
say if you should do this, or you shouldn’t do that, the
regex will catch it, throw it as a warning, throw it as a
fatal so they can’t check in. That’s simple. And anyone can contribute
to it with a simple regex. For more serious
stuff, we use Clang, which does the static
analysis and checks memory, or dead code,
or things like that. Android has some built in
linting that we use on that. This is both platforms,
iOS and Android together, so some go with some. Some go with the other. But you get the idea, right? So you have that layer
protecting your master code base. Nothing gets in unless it gets
through that red section there. And finally, the tests–
for each platform, there are various test
systems that go in. We also do WebDriver
from our UK offices that does also end-to-end
integration style testing as well. So that stack happens
all the time– all of it. How often? So this often. So during each
step of the process that whole stack is
run– every build, every test, everything done. If you’re at Google
or Facebook, this does not impress you,
because basically, I say this kind of boldly. But there are no issues with
compute power or storage. Those are infinite
as far as we care. You need to get to that. Machine resources should
not be the thing keeping you from running this full stack
while the person’s developing the diff, when they create
the diff to send it out to the other developers, when
they update it after getting feedback, when they land it in
the landing queue to check it out before it gets delivered,
and when it gets committed. Each step is going to
run through that stack. And machine resources
should not be the reason you can’t do this. This is the number of
builds were doing per day. It averages around
20,000 to 30,000 builds a day to give you the
scale for our couple dozen mobile apps. So this Async, when
it’s built and tested, what’s it look like? So when you do commit,
the reviewer– actually, this isn’t a commit. This is for a diff. The reviewer will see did
the stuff pass all the tests? Did it pass all the builds? That’s in the diff itself. So the diff tool itself will
expose any dirty laundry that you have that
didn’t pass test, didn’t pass builds– will be
there for the viewer to see it. They see that box is red, it’s
an immediate go back, hey, go check that out
before I check it out. Shrubbery is the thing we put
on top of our build system. And this is basically showing
us across the whole matrix of builds that are running
where did they fail. So if you go there and
look, you can see exactly. Like, down there, that
red bar says, oh yeah, a test failed here. You can click through
and land at the test console to understand like,
OK, where did this go wrong? And the reviewer
will do that as well. Dogfooding– we all know
the value of dogfooding. But dogfooding on mobile
is a different problem. Within Facebook, I force- if
you’re on the Facebook network, or within the VPN, when
you go to facebook.com, you’re never going
to facebook.com. We always redirect
you to what we’re going to ship– our dogfood. How do I do that on mobile? Well, I have a mobile builds
page that people will go to on their phone, and they can
download any version of master, or the release candidate,
or a previous release of the various products. And this page
scrolls down a bunch. There’s one other thing though. People are lazy on mobile. They actually use
their mobile phones. They don’t want to be bothered. On Android, I force
them into the dogfood. If you’re a Facebook
employee, you will now download the Google
Play version of our app. It will always kick back
and download and use our dogfood version. On iOS, it’s a harder
problem because we don’t have the guts to do that. We wrote a wrapper around
the app for internal use. So the problem with iOS
is you’d fire it up. And you’re in the
middle of the park, or Burger King, you want to
check in at Facebook– hey, I’m having a burger. Boom. And it’s going to come back and
say, hey, you’re on the wrong build. Upgrade now. You’re like, I’m in the
middle of Wisconsin. I don’t have connectivity. I don’t want to do it now. So it’s a big pain. What we did is wrote a wrapper
that in the background, it knows when the
new package is out. It downloads it, so when
you fire up the app on iOS, it’s going to just tell
you hey, by the way, I got the new app already
waiting for you here on the phone. Just click install. And that is going to boost
a lot of the dogfood usage for our internal people. So it’ll be very seamless
for them to keep up to the latest, because that
changes every day, right? We’re going to ship that
dogfood app every day. The test console–
if we click through in some of those
test failures, you’re going to see basically
the history of what’s passing, what’s failing, and
specifically what failed. So you’ll be notified when
this stuff fails for you. You’re going to go
look for your commit. You see your commit. You click through. You’re going to get
exactly which test failed. Our tests have an
automatic quality rating. So if we see tests
are failing or flaky, the star rating for tests–
tests have Karma as well. They will lose their Karma. And eventually they’ll be
discarded if it’s a flaky test. Click through here. Again, these are tools
that are kind of specific, but you can basically
see the history of how this thing failed,
exactly when it failed. Point being, you need to
have data that gets you down to the rev level of
when things went wrong. So again, every commit,
every test run every time, it’s easy to
basically bisect down. And say, here’s the
point which we failed. Here you go. With all this, we
still have breaks. And if you’re
committing and something breaks but it’s not
yours, you can always pull back to a stable point. So there’s a
rolling stable label in master, where if
you’re hopelessly broke at top of trunk, you
could say, listen, give me back
something I can build. You can always pull
that stable label back. And that automatically
updates as things pass. That’s a simple thing. We’ve been doing that for years. I think all of us
have done that. Don’t forget it on mobile. Test failure bot–
again, nobody likes tests because they can be noisy
and brittle and give you noise. The bot tries to take
care of a lot of this. It’ll assign bugs it
sees that are unassigned by doing some analysis,
and says, nobody owns this. But I think this
guy should own it. Or if it sees things
that have been closed, or that the test is
working now, it’ll go off and close the
test, saying like OK, this clearly works– no
reason to keep this bug open. So get the bots to do a lot of
the crappy work of figuring out where tests should go, or
if it’s open or closed. Because that’s the part that
developers really fall down on is just responding to
this endless stream of noise about tests failing
or passing, and failing and passing. So mitigate that noise
to them as best you can. With all that, especially
with Mercurial and Git, when you rebase, you
could get in a bad state. We use this thing
called Landcastle. And what this means is
when you commit into master while that stuff’s
all running, we worry about rebasing your
stuff into the latest– basically, the tip of the tree. So when you commit, this
thing’s going to say, yes I’ve cued up your thing. Your change is in there. It’s going through the stuff. I’ll keep rebasing it
for you until it lands. If in the process of
rebasing it basically finds a problem or a conflict
because stuff is coming in, it’ll send you a page. And say, listen I can’t land
your change in the master because this guy just pulled
the rug out from underneath you. So go deal with him. So again, taking the
onus off the developer to worry about constantly
rebasing, constantly checking if that
thing gets in, it’s much simpler if the system
takes care of that for you. I think at VMware we
had a similar system we had at VMware. With all this, we can
still break master. What do we do? We have the Sheriff. I do the same thing on web. Like I said, there’s two
people essentially running web. There’s 1,000 developers
against two release engineers, and we release in real-time. So we have on-calls, or Sheriff. I have a page with a
rotation for every group– all their on-calls for that week
for all the different groups. When things go bad,
and photos doesn’t work on the new iOS build,
am I going to debug that? Not so much. I’m going to go to the
list, and it’s going to say, this guy right here is the
photos on-call for the week. Here’s the problem. Fix it now. Fix it real-time,
and get back to me. Get who you need. Back out what you need. Do something. And that’s the job
of the Sheriff. This is ideal because
it gives the developers this operational burden that
really opens their eyes. They become allies of you. They feel they’re part of
the release engineering team. We’re part of the gang. And as you spread that–
as more people get to be Sheriff and on-calls,
they have this empathy for like, yeah, this sucks. We’re really screwing you guys. So they will be better
engineers and better operational people if they have this role. Their main role on mobile–
get it working, man. Just revert. Just get me back. Get me back on my feet. They’ll look through. I’ll send them this link like,
hey these tests are failing. I can’t figure out
what’s going on. It looks like it’s
related to photos. They’ll go in, and they have the
special super confidential tag, where they can get
stuff in, bypassing some of the big stack of stuff,
because I want that fixed now. So as the Sheriff,
you get checked off in the database as a Sheriff. You get a special tag to commit. And you get your stuff in. Generally, reverts
go in immediately. Bisect too is cool. Basically, you can do a
live bisect on your phone, trying to find a problem. So you can go here. Tell us what bug
you’re looking for, basically punch
in build numbers. And they’ll just suck down
from the dogfood page, so you can try
different builds, bisect until you find that exact point
that things have gone bad. All right. Let me wrap up here. So the big shock for us at
mobile was we thought we had things solved as release
engineers, and we didn’t. The process for us at
mobile was I went out and I hired Christian Legnitto,
in the back, from Mozilla. And I was busy with
the other guys. Just getting web–
was pretty good. I wanted to make sure
web stayed on its feet. I said, Christian, that
mobile thing’s a mess. Go deal with that. I threw him in– like one guy
into this big den of mobile. And he did a great job. But he had to come
back, and we all had to figure out we have
to change how we develop, how we ship, how we write code. All those things that
we had solved already had to be rethought. And the tools had to
be modified a bit. And new tools had
to be invented. But the important
thing is it was not a shock for any developer
going from the web world to the jarring cold
reality of mobile, because the culture
was the same. They all knew this
Dev/Ops culture. They all knew they had
the responsibility. It was very bearable. So I can say, quite confidently
now, Facebook is mobile. We are a big heavyweight
mobile company. We have many, many mobile apps. The team in the
back and myself are responsible for shipping
those mobile apps. We’d love to hear what other
people are doing with mobile. I think we have a lot to learn. We have a lot to share. We have to really lean on our
friends at Apple and at Google to help. I don’t like to be critical,
but they’re keeping us back. The systems have not
kept up with the reality of the mobile ecosystem. I promised I’d complain
about the Google Play Store. We have 17– when we want to
release one version of Facebook for Android,
there’s 17 packages. Because we build out APKs
for individual DPIs or chips, or whatever. It’s no fun to go
into a web interface and uploading 17 packages every
four weeks with release notes. So this is silly. I mean, let’s not
be amateurs here. Let’s get an API. Let’s get this thing
like industrial strength, so I can get things through. I promise you, I am not happy
with four weeks releasing stuff. I’m going to get to
two weeks, hopefully by the end of the half. All right. I want to ship both platforms
every two weeks, eventually every week shipping mobile. I can do that, but the
tools at Apple and Google are, right now, one of
my biggest hurdles– to get that cadence. All right. There’s my contact information. I think we have
time for questions, if we have maybe a microphone. BORIS DEBIC: Thank you, Chuck. We have time for
a few questions. And we’ll ask you
to speak to the mic, so we get the
questions on camera. AUDIENCE: So the
Karma stuff– so A, Craig from Wikimedia
Foundation, we’re actually seriously considering
moving to Fabricator right now. We’re in a mix of
Gerrit and Bugzilla and it is hell– on
Trello and Mingle and all the other
crap that’s out there. But so, Fabricator, we’ve
been talking with Evan a lot, and we’re thinking about–
oh, is it– oh, there we go. Swallow the mic. All right. So right, so Wikimedia
Foundation– we’re thinking about
moving to Fabricator. And one of things that I
liked about the features that you mentioned
that it know about was the Karma thing, because
I’m the release manager there and I have those same
stars in my head. So the very basic question–
who all– so you said you have that right to dislike. Is there anyone else
like your release team? CHUCK ROSSI: Right,
so the question is– the Karma thing
can be sensitive. And I don’t want to mean
is a mean-spirited thing. You can’t use it as a club. But it could be a subtle
way that you can keep track. And the question is
who has access to that? Well, all my release engineers. So everyone in
release engineering is in the database as being
able to see that– just the user themselves and
the release engineer. So it is, again,
a private thing. And I’ll give you an example
of how it was used for good. We had a guy in web who
was just killing us. Like, every time he’d touch
code, [INAUDIBLE] would break, or something break. Something in
platform might break. And he was down to two stars. And our policy is when
you’re down to two stars, we just don’t take your change. Because clearly we’ve– you
only lose half a star each time. So that’s four times. And we only give
you a down Karma if you really– things
had to stop working. So we’re like,
this is ridiculous. This guy, we can’t
take his change. And what’s going on? So we were able to get
with him and his manager. We’re like, what’s going on? Well, it turns out, he’d
inherited this awful JavaScript code from 1,000 years
ago– probably written by Zuck himself– that
landed in his lap. And every time he touched it, it
was just a hopeless situation. He was just doomed. So we said, OK, we got to step
back, get proper resources on this, revamp what
we’re doing here, get some real– you
can’t go on this way. So it really
flushed out an issue that was not this
person’s fault. But something that wasn’t
getting attention clearly showed up on the Karma
scores that helped. AUDIENCE: I’m Ryan
from Cloud Foundry, and one of the things that
really stand out to me was when you said that
we as release engineers need to make sure our
team is well-respected in the organization. So what are maybe one or two
of the most important things we could do to ensure that
when we say something important it is heard at the
highest levels? CHUCK ROSSI: Right. So the first thing is you’ve
got to have the attitude that you’re not there
to hinder things. The push back you get is like,
if you guys get in control, everything is going to
stop, because you’re not going to lay anything out. And you’re going to
grumpy and like that. So you have to balance
the idea of like, I’m going to be
cautious, but I really want to make things happen. The team wants to enable the
company in the developers to get stuff going. But we’re going to
be– like I said, some sort of adult supervision. We’re going to do a little
bit of sanity test for that. As far as how you can
build that, it’s tough. You absolutely need an advocate
up the management chain. And I’ve been lucky here
at Google and at Facebook and pretty much everywhere that
the organization from usually the VP of engineering down,
understands the value of what we do– of release
engineering and learns to trust the experience and
the judgment of these people. So I would really advocate
up your chain a bit and see if you can get
someone to back you. It helps if you have an
experienced release engineer on the team somewhere. Or even the other
one is– and I’m sure you can find these people. Every group has a developer who
is like a frustrated releng. They’re always
getting in your stuff. They’re always pointing around. They always like to help. They’re giving you great tools,
and they love this stuff. Get that person on your side
to help advocate for you. And those developers
who on your side are going to be a big boost for
your team and for your respect. The other thing, I”ll
just say one last thing– is do the thing I
said about on-calls. Get developers in
your little world. And say, hey, you’re the
on-call for the week. And you will get
your circle of trust, and you’ll get more respect that
way, because they’ll be like, yeah, you guys do crazy work. I can’t believe it. So those things will help. AUDIENCE: So one line that I use
that was very effective to get people to accept
what I was doing was to say, hey, you’re going to
have to jump through a few more bureaucratic hoops,
but you’re only going to do your work once. You’re not going to
have to go back and deal with this craziness
of things breaking and having to be redone. CHUCK ROSSI:
Ultimately, the pitch is you’re going
to help yourself. If you let me do
these things for you, I’m going to save you
a lot of pain and lot of redoing things down the road. AUDIENCE: Hey Chuck, you had
one of your diagrams ended with a soak period. So what happens if you get a
surprise in the soaking period. Does that effect that cycle? Does it affect the next cycle? What do you do then? CHUCK ROSSI: Yeah,
soak is not ideal. And in fact soak
doesn’t work that well, because if you’re a day into
soak, and you’re like, oh yeah, that doesn’t work. So you got to
cherry-pick, and you’ve got to push that out again. You got to collect edit. It stinks. You know what saves that 100%? The beta program. So I could be in soak. I’m in soak now the whole
stupid time that four weeks, because I have two, three,
whatever million people we have out there using the app. So when I find something,
and I’m in soak. And I cherry-pick, I don’t
have just the 4,000 engineers of Facebook to go give this to. I have a couple million
people I can go give this to and an instantly
get some feedback. AUDIENCE: But then why do
you need the soak period? I mean, why have it? CHUCK ROSSI: The
question is why do we need the soak
period at all then? And in fact, it’s
probably less of a thing now that the beta–
I need it on iOS, because there is no
beta program on iOS. Maybe if you can give the
microphone to Christian sitting behind you, who works. Can Christian just
grab the mic there. Christian works on mobile. CHRISTIAN LEGNITTO: Yeah,
so the original goal was– the way we dogfood
and the way beta dogfood is, you’re basically installing
an update every night, which is not the way our
users will actually run it. They’ll install one update
and then run it for a month. So there’s some
sort of bugs like, local caches growing
unbounded, or something like that, where
installing every day would hide those bugs. And of course, we
have to push out a new build every day
because we want to test the changes, the
cherry-picks we’ve taken. But at the end of
the day, we want to test how our users
are going to experience and they’re not going to be
installing updates every day. CHUCK ROSSI: So
we did effectively get a soak through that. AUDIENCE: Hi, I’m
Fred from Google. I’m just wondering about
how you use that telemetry. What do you do with the
telemetry you get back, and how do you interpret it? CHUCK ROSSI: So the
telemetry coming back from those various channels
will be our graphing system, our data collection system. We’ll basically graph it over
the current production values– so crash rates, TTI, app size,
bug rates, all those meta values will be
transposed over the known values for production. So if they vary, and we’re
all very intimate with what those numbers are,
and we can see. The other thing we’re getting
is individual results– specific logging data that
is only from those people. We do this on the web. It’s fantastic. We have a page for all
the log data coming back. It’s like, show
me only the stuff happening with the
new beta release. So it’ll flush out
like, hey, these are new bearers I’ve
never seen before. So instantly like, oh,
that’s all new stuff. We got go flush through that
and see what’s going on. So those are the two main
ways we can get that telemetry and figure out if we’re in
better shape or worse shape. AUDIENCE: Hello. My name is Armand from Mozilla. It was excellent
what you showed us. There’s one question I have
with regards to the home page for each team
of developers. If you’re backing out up
to the last stable state, why would you need a team of
sheriffs or on-call people? CHUCK ROSSI: Right. You’re specifically like,
if master has trouble, why do you need a sheriff
to back it out, or– AUDIENCE: You back out. You get to a stable state. And why would you need
a list of people on call if you are back to
stable supposedly? BORIS DEBIC: Right Only one
part of the job of the Sheriff is to deal with
breakage and trunk. It’s almost an easy
part of the job, because you can just revert. And a bot could almost
do that at some point if you get some
confidence in it. The real reason I need those
on-calls and sheriffs is operationally– we
operate in real time. The data coming in
from production, the website going out
every day– something is happening all the time. If something goes wrong,
if I look at my graph like Fred was talking about,
and I see the production graph for crashes just all
the sudden, bam– goes up, and I look at the stacks. And I’m saying, why are group
messages suddenly fataling? I’m going to go right
to that on-call page, find the groups person
and say, you– this is your life right now. You got to take
this, and you’ve got to figure out right
now what’s going on. Now you guys at Google
championed the use of SREs to do this as the first line to
figure out mostly for web side, but when things go
wrong, SREs responded. We’re less about that
and more about having it go right to the team. The operational people are
within the team itself. So that’s the real reason. I need their expertise
as on-calls and sheriffs to be able to look
at a problem– a stack trace, a code path. They know what
went in that week. They’re like, oh
shit yeah, we just updated that groups– that
push went out yesterday. And all the sudden,
it’s failing now. Somebody changed a gate
keeper, or somebody turned on that code. And now we’ve got to react. And he’s the best person to do
it, because he’s in that group, and can find– they’re
like, it was that guy. And then run down, you get
them, and then we do it. So that’s the big win. AUDIENCE: Do you
have any challenges with the balance of power
between product owners and testing and then
the release management? Do you ever get overridden by
your decision– get overridden by the product owner
saying, those testing, it’s not relevant. The feature being delivered is
more important than that bug. Ship it anyway. BORIS DEBIC: Yeah, we probably
don’t have it as much. But that’s the natural
order things, right? So we don’t have any QA groups. So there is no QA. That data is evident by itself. Tests pass, or they don’t. So that pretty much
settles that argument. The last is the release
engineer versus development are more likely PMs. Because PMs– I make the
picture of on the one side you have my team,
release engineering, trying to hold down the fort. On this side, you got Zuck
and the management staff wanting stuff to go. And in the middle you
have the developers. And we’re like the rocks
that crush the blood that makes the stop go,
right– the lubrication. So the poor
developer and the PMs are these little things
between these two rocks that are just grinding away. So yeah, you’re going to have
issues where– especially for the PM, because they
have the pressure like, you got to shift this thing x. And they’re like, it’s ready I
got to try to cram all of this past the release engineers. They’re going to kill me. And they’re like, if you don’t,
Zuck’s going to kill you. So what do you do? Those, you have to
resolve, I swear to god, on a case by case basis. And again, it comes down to a
subjective, judgmental thing. You look at who’s involved. You look at the code. You look at the risk. You look at the benefit. And in our case, you
think, how many users are we going to
mess up with this? All right. And you make a decision. And if you still can’t
pass, you go up the line until again, I give
a lot of credit to our executive team
with Mike [? Schrepp. ?] If it goes to him,
he’ll make the call. And if he says it– it goes. It’ll go. I’ll do it. So generally, you alleviate
a lot of this grief with a faster release cycle. So with four weeks,
we’re just at the cusp where people freak out. They can wait. But they’re like, if
they hit it really wrong, it could be weeks if they just
miss by the time they get in. So that’s why I want to
get to that two-week cycle. All that pressure,
all that conflict, all the grinding of gears
and rocks goes away. I can say, relax. You can go out the next
day, the next week. No big deal. We never have these problems on
web if you can push every day. AUDIENCE: You had a
really nice diagram about how you integrate into
your continuous integration like, testing, statistical
analysis, all these kind of things in every state
of development cycle. You’re doing it before the
commit, if I understand, before the devs are
being evaluated and such. So if you do have it in place,
why the question of Karma comes into existence? Because in theory– I
mean, the specifically designed these steps
and integrate them to our cycle is to avoid
these breakages, or whatever. So if you can talk a little bit. I’m interested in theory–
how much we can actually prove in that process,
and what is the filter? I mean, what gets through? And why do you think
it gets through? CHUCK ROSSI: Right. So I don’t want to paint
too rosy a picture here. All that– the
slide that I put was very pretty with all the
steps that go through. Obviously– and
we all know this, because we’ve all written tools
and have those things in place. They’re not going
to catch everything. Human judgment is going to
enter into the equation. The thing that makes
Karma come into play is we’re operating in
near real-time here. So when things have to
happen, and those cherry-picks have to come in for
production launch, that’s where the judgment
counts, both for the developer and for the release engineer. The tests I have will really
flush out the obvious things. But if someone checks in
a new flow for messaging, or a new flow for
group update, you’re not going to catch
up with tests. It’s going to
pass, but when it’s all together,
integrated the app, and you decide to take that
in week three of the release process, your tools
aren’t going to catch it. That’s where Karma comes in. And that’s where you say,
you should have known better. Why did you even think you
could rewrite the messaging stack in week two of a release
cycle that somehow got by us, checked in, and
derailed the release. So it’s more at a higher
level where Karma comes in. AUDIENCE: So the input I
take from that– basically, just in the theory
of [INAUDIBLE], in a situation like that where
we have to update production every week or so, we’re
always going to be behind on the quality of tests we
are putting into our system. So maybe we should
maybe concentrate more on exploring that
issue, as opposed to putting stars on engineers. Because it doesn’t matter
how good the engineer is. If there’s no test, he
won’t be able to pinpoint all the problems, especially
if it’s a huge code base, or a Legacy code
base, or whatever. So just in terms of the
[INAUDIBLE] and practices. CHUCK ROSSI: Yeah,
like I said, I don’t want Karma to be used as
a– everyone’s losing stars. I don’t think there’s
anyone left with four stars. When you move like this, things
aren’t going to go right. You’re not born with this
operational awareness that’s going to get you through. What we’re really looking for
is really a neglect thing. You didn’t follow process. You didn’t really
think about this. You took off your
operational hat long enough to make this mess. AUDIENCE: But as
you noted yourself, the neglect thing
you’re able to catch, because these are
obvious issues. CHUCK ROSSI: Not always. AUDIENCE: It’s complicated new
features, which didn’t roll out in your testing framework yet. This is what cause you grief. CHUCK ROSSI: Yeah and that’s
where– generally, those are caught because they’re big
enough where they’ll escalate, where we’ll try to
discuss that and do the analysis of the big feature. We’ve had it many times. A big feature will
come in late, but it’s really important, can’t
wait the four weeks. The press release
is already set up. And that’s not a Karma event. That’s like– AUDIENCE: Yeah, exactly. That’s what I’m trying to say. If you have a good suite of
tests to catch every change, all they neglect thing is going
to be caught before it even gets into the master. AUDIENCE: Hi, I am
Ramon from Twitter. I had a question
about the desktop web. So one of those scenarios we
hit often is more releases. We want to do more releases,
but our roll back process takes a long time to
roll back a release. And so that was the reason that
we don’t do a lot of releases. So can you talk
a little bit more about how your roll
back process works and how much time it
usually takes to roll back if you already rolled
out to everyone? CHUCK ROSSI: Yeah. Obviously, on mobile, we
can’t have this discussion because that’s firing bullets. And bullets don’t come back. So on web, it’s a
much better situation. And we don’t really
have this issue because– think of
the number of machines it takes to serve facebook.com. It’s a big number. When we push the
button to say, go. We’ve decided we’re going to
go 100%, that binaries that is facebook.com is out in about
’15 minutes, which I’ve said is both awesome and terrifying. It’s awesome, because
in 15 minutes, the fleet is on new code. It’s terrifying because
if we’ve made a mistake, there’s no time to
pull that red lever. The bus is already
going off the cliff. So what we do,
though, is obviously, we keep the previous
binaries on the fleet. So for us, it’s a simple
matter of, oh, this looks bad. We hit another button. And the send command runs,
and it just symlinks back to the– in our case right now,
it loads the previous byte code and brings up the servers. Now, that is not a pleasant
experience for the end user. It’s a bit like
pulling the red cord. There’s going to be a bump. And there will be
some disconnects. But we will, within 10 minutes,
revert that to new code. Now, the thing that
also makes this work is– I make this very
clear to every developer. You will never, ever run in
a homogeneous environment on the front end. All right? So if you make that clear,
you don’t have issues like, well, we’ve already
rolled forward. We can’t roll back because
that new API blah, blah, blah– never happens, cannot happen. With a fleet as big as we have, with
as many backend systems as we have, you cannot guarantee
you will ever talk. You’ve always got to be forward
and backward compatible. If there’s some
issue, there’s always a gatekeeper that can be
turned on or off that turns on or off that new code that’s
sitting out in production that may or may not be there yet. So we’ve really
solved that problem. In fact, if you want to
talk more about that, the guy who helped write that
system is Amir, in the back, is the a deployment
expert from Facebook. He can tell you
how that tool works to deal with getting that
code out and back in so fast. AUDIENCE: So I’m John
Oden, formerly Mozilla, and just recently
moved to Hortonworks. So when we were at
Mozilla, we were doing a lot of mobile stuff,
so a lot of these diagrams– I was like, oh yeah. And I want to plus one to thing
you said about a command line programmatic way to be able
to upload apps to the store. That was a recurring pain
in the blank-blank-blank having to do this by hand. The fact that we have to
upload it manually these days– I mean, I’m all for
secure encryption stuff. Is there anyone here who has
a sway in either the Google app or Apple app who can
make this happen, please? MALE SPEAKER: We
will make it happen. AUDIENCE: Programmatic,
sure encryption sign, whatever, but can we make
it that we can hit a button? And then I would ask for a
version of it, which is also, if we upload something, I know
it’s like, then it’s a bullet and users can start picking it. If we find out early
there’s a problem, we’ve many times wanted to
go and hit an abort button, take down the thing
we just uploaded. And the only way we
found we could take down was to go find the
previous change set, generate a new build with a
newer number and upload a new. Whereas, if we
could go say there was a previous
one already there. We just upload a new one. Abandon the new one, and let
people see the previous one. Something like that pragmatic. So there’s my two wishes. CHUCK ROSSI: Yeah, I
can’t stress enough. It’s not reasonable. One of my engineers,
Brad’s, in the back. And he literally stayed up till
2:00 or 3:00 in the morning just fighting, pushing the
stupid upload button on a web page to upload the number
one app in the world, worth billions of
dollars of revenue, to try to make
this thing get out. And we can’t do that. That’s just silly. BORIS DEBIC: Chuck,
thank you very much for– CHUCK ROSSI: Thank you, Boris. BORIS DEBIC: –the keynote. We’ll have much more
talking during the day. And we’re going to move
on with the program. Bram.

2 thoughts on “Releng 2014 – Keynote 1: Chuck Rossi, Release Engineering, Facebook Inc. | Talks at Google

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top