Model Governance and Explainable AI



I'm Patrick Hall I'll be moderating the discussion tonight and that discussion will focus on providing an approachable overview of regulatory and technical concerns around machine learning model explanation and machine learning model the disparate impact or fairness which is a very hard quantity to define and discuss we the panelists and I will probably focus mostly on credit scoring and credit lending is that's where most of our experience has been and we may you may hear us say things like a koa and vikre and those are some of the pertinent regulations and and one thing that we are really hoping to do this evening is to help the audience and maybe even help ourselves to understand how to generalize some of what we feel the good lessons have been learned in the regulations and the processes in financial services and and see if those can be made workable in other fields and we plan to save a lot of time for your questions so we'll do maybe 45 minutes to an hour of questions from me to the panelists and then we'll turn it over to you guys and so we're really lucky to have the panelists that we do have I'll let them introduce themselves so Nick why don't you go ahead sure Thank You Patrick so my name is Nick Schmidt and I run the I a I practice at BLTs which is a small consulting firm based out of Philadelphia and most of what my role is to advise regulators and primarily lending institutions on issues of explain ability and AI and also fairness in AI and in that work we develop algorithms that help minimize discrimination and we also help the banks implement those algorithms hi my name is Brice Stephens I am a new addition to be LDS joining back in March and before that spent seven years at the consumer Financial Protection Bureau and I see some alums in the audience I'm an economist by training and at the bureau I worked on analysis in support of the bureau supervisory and enforcement work in the fair lending space as well as on a number of rule makings and other policy initiatives related to fair lending and the Home Mortgage Disclosure Act and while I was there I also work closely with an internal on an internal working group that was thinking through all of these issues related to alternative data machine learning AI and the implications for compliance and and the relevant statutes and regulations that might be triggered by a lot of these sort of emerging FinTech technologies okay great and as I said I'm Patrick I'm a lecturer in the Department of decision Sciences here at GW and I guess the only reason I have any qualification to be on the stage is because in my role at h200 AI which is a machine learning software company we brought maybe the first or at least one of the first toolkits for explaining very complex models into the commercial market and that was about two years ago now so I think for us to have a functioning discussion we're gonna have to start out with some definitions okay and I'm gonna try to tackle this whole issue of statistics versus machine learning versus AI I'm gonna let Nick handle explainable AI and an alternative data and I'm gonna let Bryce handle disparate impact and fairness okay so when I talk about AI I like to it's very hierarchical okay so that the outside of the hierarchy is the computer science department okay so AI is a sub-discipline of computer science focus on making computers being able to work through processes like humans or better than humans machine learning is a sub-discipline of AI and machine learning is very specifically focused on learning rules from data and then trying to apply those rules to new data to make decisions and machine learning technologies are probably the most promising commercial applications of AI right now there have been other types of AI applied in the past such as expert systems and and then how how does this all relate to statistics well I'm a horrible statistician but statistics is just it's a different department okay that's how I like to think about it it's a different department in the school but statistics is also focused on learning from data but they have a very different tradition than the machine learning and I think I'll kind of leave it at that and and you guys should feel free to add on but I will I'll pass it to Nick to talk about explainable AI and alternative data so the thing with machine learning an AI that I think we all know is that there are what are known as blackbox algorithms which means they're incredibly complicated it's difficult to know ultimately why the data that you put in led to the predictions that came out and that has some real issues with people being willing to implement AI and willing to trust AI because why would you put a lot of money on the line when you really don't know what's going on and so explainable AI essentially is trying to break open the black box and figure out what's going on under the hood figure out what you know why different variables are making an impact on a particular prediction globally are these variables impacting the predictions in a positive way a negative way or some sort of difference I you know varying on the depending on the particular observation also looking at when you get things wrong and when you get things right why did that happen and as a result of those things and understanding that then you can get more trust in the AI algorithm and you'll be more likely to implement it and so explainable a IELTS is just about getting an understanding of the algorithm alternative data and one of the things I think is really important to understand is all of these terms are related and they have a lot of overlap but as Patrick was pointing out they are different terms I think that there's a lot of misinformation and a lot of things that people don't understand and get confused by as a result of the terms getting confused and when we're talking about alternative data what we generally mean and I think what the CFPB and talked about in in their RFI believe it was was then alternative data is data that in banking was data that had traditionally not been provided by the credit bureaus and so what that meant is it can range from anything like time series data just looking at the trends of a person's accounts over time which i think is fairly basic and easy to understand and then it branches out into more nuanced eight and really things that haven't been available for a long time things like rent and utility data education information and social media posts just to name a few examples there are a few concerns about alternative data and I think these are the things they that are really important to understand what the concerns are when they're applicable and when this is a problem with AI and when this is just a problem with data and one of the big concerns about alternative data is because it's relatively new there frequently issues of data availability new data sources have have not been verified and well not all but many newer data sources have not been thoroughly vetted thoroughly verified there doesn't just doesn't have to happen to be the history in there so I think that that can potentially be a problem with them and similarly the reliability and accuracy can be a combination you and these sorts of things are incredibly problematic in banking because if you're making a decision based on someone's rent history payments and they've been paying their bills for example and they've been paying their rent and you don't have that information because whatever data vendor you we're working with didn't provide it you ultimately reject them for that that's not really a fair situation and so that's something that has to be considered there and and the thing that's kind of near and dear to my heart with alternative data and it's its potential problems aren't issues of fairness and it's an issue of are these days are really predictive of of what you're hoping to predict or is it predictive of things like race or gender or age or any other type of protected class status that would ultimately make your decision unfair and we clearly don't want our customers to experience that right right okay Bryce I'll give you four minutes to lecture us about disparate impact and fairness as Nick kind of teach us up for sure so let's tackle the fairness piece first you know I think that that in your profession and folks who are both in academia and practitioners in this space the conversation about fairness Network algorithmic fairness is sort of become fairly prominent I think and about four years ago when I was at the bureau and working on issues related to FinTech and algorithms and AI and compliance kind of was connected to the fad ml group I went to an early meeting there and it was clear that the sorry fairness accountability transparency and machine learning Fadiman org and some really interesting and great conversations happening around fairness in the context of models but not a lot of conversations about fairness standards that might actually apply depending on the context in which models are being developed and deployed and so it turns out that in this country we have fairly strong anti-discrimination statutes that basically raised or opposed a lot of compliance risk for models in the deployed in certain contexts so for example in the context of credit the Equal Credit and Opportunity Act is sort of the most prominent anti-discrimination statute that's federal but you have title seven and employment also has anti-discrimination provisions you have FHA which is title eight also anti-discrimination provisions and it matters in the matters for the algorithms that are that are being developed usually when we're thinking about algorithmic fairness in the context of credit where we're most concerned about what's called disparate impact discrimination and that is when a facially neutral policy or practice that has a disproportionate impact on a prohibited basis which I'll define in a moment unless the practice meets legitimate business need that cannot reasonably be achieved as well by less discriminatory alternatives and so those protected under the Khoa it's actually it's prohibited to discriminate on the basis of an accreditor action on the basis of race color religion national origin sex marital status or age and there's a small exception there that we don't need to get into right now and for a few other reasons but the point is is that you can't have an algorithm that discriminates on the basis of race for example and you certainly can't do it in ways that are over so the statute also requires that you not intentionally rely on those characteristics when you're making a credit decision so what does it mean for models with very few exceptions you can't make decisions that explicitly rely on prohibited basis you can't include them in your predictive models you might have attributes in your model that are somewhat predictive of an outcome but highly correlated with a prohibited basis like such as race and that raises concerns around potential disparate impact liability and finally a nickel talked about this in a little bit that there are ways to mitigate or minimize that risk and the approach that we sort of have been working on is one that's focused on looking for our alternatives that are less discriminatory than the one that you may have that may have emerged from the the model building process and we'll talk more about that later okay great I don't think we could have asked for more so and I mean not a good way very very helpful sorry I'm sorry better I we couldn't have asked for better okay sorry sorry I appreciate that okay yeah yeah no and you were you were under all right so next you know I've you know now that we've kind of set up some definitions about machine learning and AI and explainable AI and disparate impact we want to jump in to this idea of what's typically called model governance and this is a set of best practices that's that's associated with regulations and maybe stems from regulations but I will I will definitely let Nick and Bryce sort of flesh that out for us but these are basically things that that people do in banks sometimes insurance companies sometimes HR departments that help their models be transparent and have less disparate impact and I'm gonna turn it over to Bryce to discuss you know why where there's this kind of traditional tension between the people I call the model makers and then the model breakers the validators the internal validators and the external regulator so I'm gonna let Bryce talk about the kind of classic tension there sure I guess a few things to note the regulatory framework and financial services is somewhat of a patchwork it depends on who you are as a lender whether you're a depository institution how big you are whether or not you are subject to Prudential oversight so is the Office of the Comptroller of the currency are you under their supervisory authority are you not and so that tends to sort of shape the way that institutions sort of think about compliance and so the very large retail and commercial banks tend to have very robust compliance functions they're heavily regulated heavily supervised and subject to a lot of oversight and they tend to have very robust internal model governance compliance functions and and the tension that you're talking about I think is both real and perhaps intentional and if you sort of the for the Federal Reserve Board and the OCC put out guidance on model risk management affectionately referred to as SR letter why do I always forget this eleven seven eleven seven thank you corrected yesterday we're actually the its ending the guidance actually talks about there those functions needing to be separate and there needing and there needs to be the authority there's someone in a compliance or a review position to actually be able to to separately validate and review models and then actually have and in the power to challenge them internally so that's sort of how things can work on the inside and we see how these relationships and these systems work just through the clients that we've worked with and there's also of course the oversight that institutions receive from their their regulators and and I think that where we are right now with with compliance and especially thinking about models is sort of a lot of heterogeneity across lenders that derives from sort of their experience in the market the degree of regulatory oversight that they're experiencing in the extent to which they are really probably prioritizing compliance so Nick can you tell us about you know okay so so we've heard this tension between the people with the banks or other companies who are making the models right and then they have these internal validation teams that try to make them behave and then they have external regulators that are trying to make them behave and so Nick can you give us a little bit about you know what what is model governance trying to accomplish sort of tactically and what what in the world does that have to do with explainable AI so one of the things that I really feel is that I good robust compliance and model governance function can be very healthy for the for a bank or for any institution it's using statistical models and there are a couple primary areas that compliance should in in model government should focus on and they're generally in it's all risk mitigation which no one likes because that is constraints on everybody else's behavior but it's business communication regulatory compliance and reputational risk mitigation are sort of the silos I think and one of the things about model governance that's important is they're likely to have a much broader understanding of what the entire business is going to do or who's doing so that if one area of the business has one focus and the other air another area of the bank has another focus the model governance can see that and make sure that they're not essentially competing against one another or overextending themselves and one way we saw this in with one of our clients was they were a card issue and the issuer and they had deals with multiple different retailers and they had a marketing offer and one of the marketing offers that was going to go out was and this was associated with one of the retailers would have been in direct competition with another one of their retailers model governance saw this we were a part of the review of it and model governance said wait you can't do that kind of thing it's going to ruin our relationship with both retailers and potentially endanger future relationships as well so that's one way that in in a financial and business risk mitigation setting that model governance can help in in a broader financial situation what we see are things like Value at Risk calculations and capital requirements and all of those overlap between business and regulatory compliance but ultimately what they do is try and ensure that the bank does not overextend itself and then when you get into reputational problems those also overlap with with regulations of course and you have situations like unfairness and also discrimination and the way I put it is that you don't want your CEO testifying before a Senate subcommittee because that's not good for your career right and you don't want your company on the front page the New York Times and that is what model governance and compliance can help avoid and and it's important to note and this was kind of along the lines of Bryce's mentioning I think is that unfairness or discrimination can be entirely unintentional and so you can get in trouble or your bang for your company can get in trouble even though it has absolutely no intention of doing so or it has no intention of discriminating so the compliance function since they're focused on discrimination and since they tend to have a wider variety of folks working in those departments then you know just a small silo of modelers they're going to tend to be able to find those issues better and help head them off and make sure that the company does not get into trouble okay so any comments you'd like to add okay I can add a little color commentary on this my my experience with this has been model governments oftentimes involves writing very long very tedious reports for the for the internal validation teams and and so someone who tries to make tools or data scientist or data analyst you know that that's been one sort of thing that we've tried to help them with is is reducing sort of the human burden in writing these very long reports but I I think that even though it's annoying it's very very essential to have this kind of human review of our AI systems because if you don't know what you're a AI is doing I mean who does right so so I think this this basic idea of human review is one of those things that we would think can be generalized from from this kind of highly regulated financial services space into just you know selling ads or sending coupons or whatever it is that the people are trying to do with AI or facial recognition okay with that let's let's transition to talking about some of the regulations a little bit more specifically that that sort of make the model govern attack so you know as as Bryce mentioned we're we're lucky to have him he just came from the CFPB and he he was really hands-on and a lot of the stuff so so what was the CFPB's role in in terms of regulating this kind of stuff sure so again I'm an economist on an attorney and I some of my attorney friends are in the audience who are more knowledgeable with the law than I am so if I get it wrong though I guess chide me later but the Bureau has been around for almost eight years I think it was July 21st 2011 when it went live and it was created to protect consumers and credit markets has a variety of Supervisory enforcement in rulemaking authorities related to laws and legal authorities relevant for for lending in this space it also has other sort of statutory purposes related to education and advocacy and other things but I'm sort of most familiar with it the supervision enforcement and rulemaking 3 of those sort of a most important laws relevant to this conversation we're talking about models and credit are the Equal Credit Opportunity Act the akoa which we mentioned mentioned earlier has two provisions sort of major provisions one is the anti-discrimination provision the other is an adverse action notice provision that we're going to talk about later but designed to protect consumers in credit markets to protect them against discrimination it's also the Fair Credit Reporting Act which is also a disclosure statute and requires notification of why you might be done if you're denied access to credit while you're being denied access to credit and to verify certain features that may have been relied on in a credit decisioning process and to correct any of the those anything that might be a sort of error written and then the final one is an authority called unfair deceptive and abusive acts or practices it's a legal authority that's a little bit broader than what the FTC has they have a similar Yuda statue kind of missing one of the parts but that um it also protects consumers against practices that might be unsavory and otherwise lead to harm do you want you mind if I yeah no I won't so one of the things that I think is really important about some of the things that Bryce was talking about in terms in particular regulations is that given people's current trust in AI or should I say distrust in AI there's a potential for further regulations coming down the road certainly not right now but maybe in 2020 and so I think it's important to understand these regulations because anything that does come down the road in terms of both for banking or for industries in general are probably going to be similar to what we see right here and what is currently done in banking so I think it's important to have this understanding and maybe be begin to think about how these types of regulations would affect your industry I don't think that the I don't while the code in the red is implementing regret be could really use some refreshing I suspect that it the sort of not you know let's hope that the key provisions of that statute don't change radically but I think that what we do need to think about is how with emerging technologies and algorithms and models and alternative data like how do we demonstrate that they are in compliance with with the law and that's sort of the real challenge and I do think that you're right I mean with respect to does some future regulation come that sort of word were blindsided by and there's the recent senator Booker and colleagues proposed algorithmic Accountability Act which said it puts forth you know puts a stake in the ground with respect to what companies should be doing companies who are using algorithms what they should be doing to document and evaluate them so I totally agree with you I think that being on top of being ahead of the conversation is the in the development here is kind of the right place to be and we you know we don't want anybody from many social media companies to run out the door now because they think that this isn't going to apply because my gasps is that at some point in well it probably should is my opinion so I agree okay so now that you know we we've introduced these regulations a little bit I'll see if I can parse this the way the way I understand it right is there are two basic goals with the regulations which one is to make the decisions that the models make explainable okay and the key reason that they are the key reasons and all that these guys have their say here the key reasons they need to be explainable is one for for documentation and human oversight of what's going on but for maybe more importantly human appeal of model decisions okay if I don't know how the model works it's very very hard to appeal the decision and this is real okay there there there was a New York Times article in 2016 someone was being denied parole because of a faulty black box algorithm right and this has nothing to do with credit lending this is actually a completely you know nearly unregulated space which is mind-boggling but but this is real and it's happening today okay so so the regulations say the models need to be transparent so that their inner workings can be understood then they also say that the models need to have minimal disparate impact and that's kind of disparate treatment is when we're kind of intentionally racist and I'll let you guys correct me disparate impact is kind of like when we're using features in our model like someone's propensity to play tennis and then that's correlated with someone's some kind of protected attribute like someone's race or gender so you know who's heard of deep learning okay so I was just at ICL are like the new important big deep learning conference the first hour and a half of that conference every all the people attending in the conference in the same room was exclusively about fairness not about deep learning at all okay so fairness is really and and I struggle to even use that term because it's so hard to define but it's really it on the mind of like leading machine learning researchers right now and and so both of these things explain ability and minimizing disparate impact or what the regulations say and from my perspective as a tool maker there's there's like an arms race to do explainable AI so that people can use machine learning for credit scoring and then you know in the academic setting it seems like there's a lot of focus on fairness right now so these are two really hot topics and I'm gonna let Nick discuss transparency and explanation and and get into the details a little bit there and then we'll we'll come back to disparate impact and fairness so one thing I do before I jump into transparency one thing I wanted to mention in in Patrick who's talking about this conference where fairness is was really a central topic and you know in the work that I do I've seen that misses true and I've gone to a number of academic conferences where where fairness has been the primary or sole topic one of the frustrating things that I've experienced that those conferences is there's a significant amount of argument over what is the definition of fairness and I agree with that and I think is very important to explore but the thing that I'm worried about is we spend so much time worrying about the definition of fairness that we don't actually move forward in making our models fairer and there is what is it probably 50 years 60 years of work that legal work that has been done where fairness has been evaluated in employment healthcare banking housing and the rules are already pretty set and they very well may not be perfect and they should be reviewed but there is at least there are guidelines and what I think would be great for the academic community as well as industry is to start thinking about how can you make models fair within those context because if if we don't really work towards that then we're not really helping people and I think Patrick's Patrick's an example of the parole case is very important I mean sometimes I feel like when I'm building a model and sort of don't realize the effect that a model might have on someone denying a person on a mortgage is is can be catastrophic I mean getting a mortgage mu can move a person up into the middle class giving a mortgage to the wrong person can cause incredible amounts of harm as well so our models really have a very strong impact on people's lives and if we don't keep that in mind and we don't take these ideas of fairness into into consideration we're really doing our customers into service and all I'll throw in beer I'll you know stand up and preach the basic test for disparate impact is extremely simple I'm pretty sure everyone in this room could do it possibly with paper and pencil so if you're making models that affect human beings I really feel like it's something you should you should do and just a father looks I didn't mention this earlier we're usually thinking about demographic parity so looking at a model outcomes and asking whether or not on average do those outcomes differ on the basis of a protected class a prohibited basis and it's not the only notion of fairness we can think about other aspects of models and think about fairness but it but it's sort of that's the typical way of thinking about things and sort of ignoring while the multiplicity of definitions that fairness that might apply in in these sort of models that are classify individuals into you know good and bad risks Yenta patrick's point mean the adverse impact ratio is three lines of code and i'm a bad coder although getting into working and just about every possible situation i've managed to make it 150 lines of code so but still those three lines are good enough but should I go to transparency yeah tell us how how how do we make the models transparent or whatever you wanted to say about that well the first thing to do is to use I h2o software I am getting paid for this right when I'm not with PETA you know there's lots of good software and and will I will show you many many types of software if you're interested but I I do really appreciate h2o software but one what they what is really great about these software packages in general is that that is how far they've come I've been in the AI machine learning space for probably four or five years now and when I first got into it there it was kind of a hopeless situation lime came out which is one method analyzing what's driving a particular prediction and it's good but it's not great and then Shapley values came out what in 2017 and those really taken all the rage and and what's great about those is that they can give you both global and local in other words individual understanding of what variable is causing a particular decision and that gets into the adverse action notices which I believe Bryce talked about which is about whether if you maybe few as a bank make a decision that has an adverse effect on a person in other words you give them a high interest rate or and you reject them for a loan you have to issue an a notice that says we rejected to you for X Y and C reason well in the traditional logistic space that was very easy relatively easy at least but in machine learning where we don't understand what's driving a prediction that becomes very difficult but the great thing about Shapley values is that they can tell you but the direction and the magnitude of a particular variable it's affect another nice thing is they're additive so if you have say ten different variables that measure ten different measures of how long you've had credit saying you time you've had a FICO score time you've had a mortgage time you've had X Y or Z you can take the effects of each of those the Shapley values from each of those and and sum them together and get a single value that tells you the time on file or the I'm sorry that the time with credit and then you can give that as an explanation to a customer and they understand why they got rejected can I just add one one thing that so in addition to the sort of the the legal requirement that both the akoha and vikre impose with respect to adverse action notification I think the explained ability piece is also really important we're thinking about the fairness question and we want to know sort of what what what in the model is important and what might be important drivers of of disparate outcomes that were that were observing so just wanted to kind of throw that in there yeah I I totally agree and I just I want to I'm gonna try to step back and just you know make this as simple as as I possibly can and and just to make sure people are following along so if you've ever been to nine a credit card let's say which I have been like when I was young I remember being denied for a credit card and they have to tell you why they have to say well your length of credit isn't long enough your credit score isn't high enough XYZ and there's what how many do they have to tell you five this I think it's up well the guidance is up to four okay it's not but it's not statutory it's in the staff commentary so anyway the poetry of regulation yeah so if you're using you know fancy deep learning neural network for this that's taking all the variables that go into your model and combining them and recombining them and recombining them and recombining them literally millions or even billions of times it's really really hard to say you know what was the one reason that you didn't get the credit card okay and so these Shapley values that Nick brought up have been instrumental and they're they're a very precise mathematical way to take the the machine learning function form that recombines the variables over know over again and say for this person what was driving the decision was these five variables and so that's that's been really critical for the for the explainable AI and Amber's action notice sort of progress in that field with machine learning you want to add yeah I think that well I was at a conference a little while ago where there was a there the modelers from FinTech were talking about Shapley values and they said that the adverse action notice problem had been solved in AI I think that maybe a little bit of yeah an exaggeration but great things have been to be in flow yes pretty close yeah but I think Bryce's point about fairness and I think that's where we were about to go Patrick yes is is really important and explainable AI can really help to get start to get an understanding of fairness and whether or not your model is just discriminating and it's important to realize that fundamentally every statistical model is inequitable you're looking at averages or medians or things like that depending on the model typically averages conditional on some factors which means that some people you are giving a favorable decision to I am you should and other people you're giving an unfavorable decision but you should and the question though is what is driving that inequity is there any way to decrease it and also using related to things like race gender age and all the other things that that Bryce mentioned and the answer to that is it related to those is very frequently yes and in machine learning we have to be especially careful about that as we add more variables and more different types of data because that can be that can really be an issue and I thought I'd just give a couple examples yeah yeah we're Gail us with your war story so one of the interesting things that came out was a paper that used a German credit dataset that was based on all and one of the things they looked at was do you use an Android or an iPhone and how much does that predict your credit risk well what was really interesting was that it was actually able to distinguish between people at the 50th and 80th percentile of credit risk according to their traditional credit score so that's a pretty astounding factor for something like just what phone you use but in the US and I don't know if this is a problem in Germany but in the u.s. there tends to be in the African Americans tend to use androids much more frequently than do non-hispanic Caucasians and the reason for that is probably twofold one is that there's a wealth effect and the other is that there's a social network effect if my friends use iPhones I'm more likely to use iPhones if they're if they're using androids I'm more likely to use androids and androids also tend to be cheaper some people with lower levels of wealth will tend to use them so those are two entirely different things wealth may be a reasonable measure to use in credit score but social network effect is probably not especially when you have a situation in which there's you know African Americans tend to use these phones which is going to drive at more African Americans to use these phones because our country is unfortunately so segregated so using this as an idea a variable in credit decisioning could have real problems despite as potential productivity there are a few other things that that can be real problems and this is this one is a problem with with traditional models and may actually be a bigger problem with traditional models men with AI and it's under representativeness of data and we see this in the imaging examples facial recognition where there were very few african-americans and Hispanics and Asians in these data sets that were used to construct facial recognition images and so they were on identified correctly at a much lower rate so there were just far more errands for minorities and this was being sold to law enforcement agencies yeah and and there there goes my that's another example of these are these decisions these things that you're doing have big impacts on people's lives and so you really have to be careful about it so part of the problem they think is that then there simply were not enough minorities in the datasets that were being used to train these models and so that resulted in a discrimination the other problem that can really be an issue and again it can be an issue both types of data but I think that in the machine learning nai space this is getting to be a bigger issue is perpetuating bias if you are using a data set that is based on human decision or incorporates human decision then itself has discrimination in it then you're likely to perpetuate that discrimination as you move forward there's one important caveat to that and I think this is this is really important because it's a big selling point of AI in that as we move into automated AI those bias decisions the discrimination that would be was caused by the human decisions as the system becomes automated and kind of goes through round after round after round because the machine is just looking at the predictive quality of the model it's eventually going to see that that discrimination is not predictive and it's going to hopefully fall back on it and stop you incorporating that so ultimately that's a potential value of AI that I think is often overlooked my my boss always says that software should be easier to fix them people so let's hope you want to chime in there on anything yeah I just want to quickly mention for my legal policy friends in the car that I think the provision is that adverse action notice provision of the coca-cola and rugby really needs refreshing I just want to point out that the regulation includes of a model form of 24 reasons that was I think developed and I don't know the 90s or the whenever but really refer to traditional measures of credit so I think that as we move into a world where different types of information the so called alternative data that Nick was talking about become more readily used in these contexts the existing model form which is just there to serve as an example becomes more difficult to use because your shoehorning things and just sort of reasons that are tied to traditional creditworthiness and so if there's a lender that's using information about where you went to school or what you majored in figuring out how to provide that information to a consumer becomes a little thorny ER and you know there's no reason that lenders need to use that form but they do because it creates a safe harbor and there's some confusion around what lenders are supposed to be with these models and lenders are supposed to be doing in these contexts am I supposed to tell you why you were denied access to credit and only tell you the reasons that you can change going forward so if it's your payment be a past payment behavior or something like that but what if it's because my model says where you went to school and the major you chose you chose was a primary driver why we denied you access to credit and you know there are those out there who believe well since you can't really change that why would we tell it why would you tell a consumer they can't do anything about it and so there's a lot of confusion around what it actually means to comply with this this sort of transparency or explain ability provision in the mall okay no I think that makes a lot of sense all right I'm gonna transition very quickly to talk about you know how feasible is this today and oh I'll get the discussion started if Nick and Bryce want to jump in fine if not we can skip on to the next topic Nick Cannon at this I hinted at this so you know my company among several other commercial software vendors already makes software for this ok there are numerous numerous open source packages for explaining models disparate impact analysis debugging models interpretable nonlinear complex models ok now I think Nick is right when he says nobody nobody has this fully solved yet or very very few people do I'm actually I'm aware of two financial services companies that are that are pulling this off as we speak and they may be aware of more but it's certainly the knowledge hasn't trickled down to the rest of us so to speak but what I really want to emphasize is there there is commercial software available to do all the things that we've talked about and there is free open-source software available to do all the things that we've talked about today ok so you know my my personal take on this is that it won't be the incremental easter egg hunt that is you know deep learning it won't it won't be the slow down and deep learning progress that causes the next AI winter right are you familiar with this term AI winter so in the past AI has become so overhyped and fallen so short and it's actual results that the major government funding agencies have basically killed off all funding for AI research that's happened at least twice some people say up to four times in the past okay so if there's a next AI winner I really feel that it will be brought on by sort of lazy practices around algorithmic algorithmic discrimination lack of transparency so there can't be any human review or human appeal and and just errors right Nick brought up this Amazon facial recognition product that's gotten a lot of attention recently that was better than human accuracy on white males and had a 60% accuracy rate on black females or something like this okay so it was just wrong it was just crappy so I think that that that is is you know I'm not so worried about deep learning or failures of deep learning causing the next AI winner I'm worried about sloppy practices that cause discrimination security breaches and lack of transparency so that's that's my soapbox we can do this today if we work hard there's no reason not to really and if we don't then we're gonna get into a position where the government might really step in and just say machine learning to leave I mean you know that's trakone e'en but but they could really put draconian regulations in place if something really bad happens because a very very complex subject and very hard to understand and and I don't think anybody in this room wants that so I'll let you guys correct all the bad dumb things I said starting with Bryce oh nothing to correct okay but I want to sort of echo this concern about government regulation and just sort of provide a little bit of insight you know in the in the policy space there's been a lot of talk about in the in credit promoting a consumer credit promoting innovation the promise that new types of data new methodologies will help reach deeper into that pool of folks who are so-called credit-invisible folks who aren't part of the traditional credit system the hope that these methods and data will bring them in but at the same time I think there's also a lot of concern about how to protect consumers in that in that in that world and so a lot of the conversations sort of are ARCIC sort of considering both those benefits and risks and I think over the past five to ten years there have been a number of government agencies and regulators who have issued reports the bureau put out a request for information with respect to algorithmic fairness and machine learning again we mentioned earlier the senators Booker widened in Clark and the algorithmic accountability act that men might not have legs but nonetheless it is sort of putting again I said putting a stake in the ground with respect to sort of how we might start thinking about what regulation would look like in that space and so I just I think the conversation is happening now and I think they you're you're right I mean if one big bad thing happens we may be sort of thrust into a new regulatory world that we're not thinking about where we want to be today it's gonna be really hard to to land on our feet if that's something like that does happen in the future yeah and not to go back to Patrick's point about the AI winter my understanding is is the last one was caused just by the performance of the systems their old neural networks generally perform performed better than traditionalists is still not a methods but I not that much better and their are so complex and so on on intuitive that people just didn't want to to use them and I think Patrick's right that this next AI winter if it happens which I hope it doesn't is going to be caused by human failure rather than model failure because one thing that I think we can all be certain of is that deep learning and AI and machine learning in general has made huge strides and it's no longer a matter of these systems aren't doing a better job they're doing incredible jobs they're doing things they that really could not have been possible five or ten years ago but they may be making decisions that ultimately people feel this is not worth the risk and so I think that we really do need to engage in these classrooms of our these models subscrib nating are they being unfair and in you know the the title of this section is feasibility and one of the points that I really want to make is that doing this measuring discrimination fixing discrimination looking at ad errors and whether or not they're related to protected classes and and things like that it's all possible now and so it really is incumbent upon all of us to be doing these things because in it's the right way to treat your customers and it's the only way that AI is going to continue to be accepted yeah and and I always call this a win-win-win one is it's just kind of the right thing to do I think to it it's a good way to minimize risk as we've talked about a lot right I and that's a little bit different than making money but minimizing risk is often a good thing and and so I guess the the two wins aside from it being a good thing or one an operational risk model risk and the other is reputational risk like Nick brought up earlier I'm always you know who wants to wake up and find that their company their boss their group their model is in the newspaper being dragged on Twitter for or being wrong in some very unfortunate way okay just for the sake of time I'm gonna switch us over to federal government interest so I think I kind of cover all right nothing yeah I threw that in when you sort of made the observation about regulations I think I've kind of covered what I think I would want to say there but I did want to sort of echo Nick's sort of concern observation or I should say observation that actually these are solvable problems and just sort of emphasize that I think we solve them by you know the practitioners and folks who are developing these models and thinking about them in academic contexts and elsewhere are actually having conversations with folks who are engaged in the policy-making effort because I think the extent you know if that doesn't happen then I think we end up in maybe even worse outcome yes then in a world in which we in which we don't okay and then I'll just just in case you you know just I think the the sort take-home of those last two segments was it's possible to do this today and you know corporations are interested in doing it today and the federal government is interested in doing it today so it's just going to be a real shame if it doesn't happen and I think you know probably most people here here because they they want that to happen so very briefly will we'll talk about what we see as some of the future opportunities and I'll start it off and let these gentlemen chime in so we've talked a lot about the dangers of machine learning and I think I'd like to spend just a moment to talk about you know a technical aspect of machine learning that might make it even better so you know there's this there's this idea machine learning called the multiplicity of good models so for any one data set there's a very very large number of highly accurate models and this is pretty different from the linear model world the more traditional statistical world where for any given data set there are really only a handful of good models okay so so I think it's very possible that that with machine learning this idea of the multiplicity of good models just opens up a lot of opportunities for for minimizing disparate impact while having maximal accuracy for for having maximal accuracy and also being transparent there's just a lot more options and I think that's really exciting and then I'll highlight what my boss always says to that I really do think that software is easier to fix then than people and massive social problems and of course they all get tangled together but you know I'll leave it at that the software should be easier to fix than people go for it future opportunities ok sure I kinda just want to make one point that aside from just thanking everyone for attending battery Dan is this one work yes just that I think that there's been a lot of back to the government interest sort of all sort of talking a lot of there been a lot of reports and information that's emanating from the government about sort of the risks and the opportunities and so on and so forth but not a lot in terms of guidance in this space and which can be difficult to do but also just even solutions and so I think there's a real opportunity here again kind of going back to what we maybe this common refrain but that waiting for Annie so that even waiting for government to kind of come in and say well this is what you should be doing it's probably not a good strategy because it's not clear that they government knows what should be done there are real opportunities for folks who are familiar with policy in law and who are doing work in this space to actually think about what those solutions should be and so I think that like that as a sort of a very near-term next step forward is a very important one to take cool I didn't steal here because I feel like I got my point one of my points from you actually I hope I didn't steal your closing comments I should thank you yes the multiplicity of your model so I should thank Nick for that good I'm glad you got it from me because I was thinking on that was my point okay sorry sorry but but just as an example on that because I think it's a really important one use one of our clients was implementing a random forest model and there was this pretty severe disparate impact in it and so what I did was ran 40,000 different random forest models with different features and hyper parameters and based on different subsets of the observations and what I found was that were a huge number of the models were similarly predictive they were the same quality but the amount of disparate impact would changed quite substantially so what that tells me is that really is an opportunity in machine learning there are it is possible to come up with fairer models that are not that much less predictive so go forth and and you know try that the other thing is AI can be a force for good I really believe that AI can be fair and then traditional statistics and it can help people and there's just a responsibility that we all have when we implement AI and the other point I want to make is just that explainable AI has come a long way like I said I don't know that it's been solved but I think it's been solved to sufficient degree like 90 percent or something like like 90% yeah not 90% song solved enough that if you have a smart group of diverse people reviewing the model you can have a good reliable outcome okay and we've been talking at you guys for a long time I'm gonna turn it over to audience questions and I think I'll just highlight something that took me a long time to understand that that rice brought up when when we say that the AI will sort of unlock more lenders what that means is the AI model should be more accurate have the capability to be more accurate so the credit lenders feel more comfortable lending to more risky people because they can make a more accurate decision about whether that risky person is going to pay back their loan or not so so another thing that people do talk about in terms of the good of AI is being able to lend to more people so I think that's another positive that people often talk about and it took me a long time to figure out what people meant when they said that okay so we talked at you for an hour hope it was interesting let's let's do some questions or okay let's let's go back there and then we'll come here and no no no sorry okay I'll get to you Oh model cards yeah yes yes as a requirement for any model that's available so you know just telling you things about maybe it doesn't eliminate the bias in the models but we told you this model has some bias in this area or this model has been you trained by this type of a guard maybe it's good for picking songs but maybe don't use for pretty sure wanted to see what you thought about first of all the feasibility of maybe encouraging if not enforcing you know labeling okay and I'm gonna let me repeat it for the camera so the question was there was this paper from Google this year about model cards which is which is back to our point about model documentation and sort of a formalized sort of summarization of important things about models and how how feasible is it I don't not necessarily as a technology because if as a technology we have it finished how feasible is it to actually have people do this on a you know wide scale I so this is actually kind of where the one of our points about that practices of banking can be extended to other industries I think banking already has a lot of those model card type things already built into it whenever a model is approved and so I and I fully support that I think it's a very good idea I don't know how much those should be necessarily disclosed to the public I think that that may be a policy issue that needs to be considered one of my worries is that is the idea that models will have disparate impact there there may be some biases in them and if you release that information there may be confusion about it and I think it's important to to make sure that the people who are reviewing those documents understand them and understand what what the various metrics mean probably having them available to regulators would be a very good idea and in terms of any any comments on like do you think we can make you know marketing departments do this more do you think we can make marketing departments do model cards yes yes okay I just wanted to add I mean I think that for thinking for thinking about disclosures to consumers to sort of influence behavior it's sort of that's a kind of a slightly different animal than just disclosure to some other sort of maybe more sophisticated reviewer and so this this proposed algorithmic accountability act you should read it because it kind of falls along these lines would require as written and you know again this is only a few weeks old would require folks deploying algorithms in whom he served of thresholds for hep for needing to comply but to actually document and to evaluate their models for certain types of biases and all these sorts of things it doesn't compel the model deployer to share that information with the public but it certainly says it's a voluntary vision but the the but to echo next point what it does then do is for a someone who's coming in to supervise that institution if it's a bank regulator or someone else there's a very clear documentation of how that model functions and so it's very good be very clear then to start asking questions about well you've I thought you've documented a bias in your model and this is a credit contacts so there may be in a KOA a discrimination problem here or whatever context there may be some legal fairness or non-discrimination provision that sort of would cause for you know any sort of identified biases to be sort of maybe questioned with respect to the law but also probably create incentives for folks who are deploying models if they're being forced to learn something about the model to actually correct it before they put it into operation so I think a lot of it is just you know making sure sometimes not knowing is a safer way to kind of manage there were some sort sorts of risks and so if you could force that sort of it self education or documentation and can kind of get you probably a good part of the way forward and I'll I'll I'll just give a plug really quick at h2o we think this is a really good idea in our commercial software every model we build comes with a report it's a good idea let me let me just repeat it for the camera that's all can we talk a little bit about or what we consider somehow combining mathematically in an objective function objectives for accuracy and also objectives for minimal disparate impact so so that's what we do on or at least since some of the techniques that we use and our methodologies sort of take these ideas and really put them in the context of the law and our understanding of how the law functions and and the requirements that a bank has to follow in order to make sure that they're not discriminating against their customers there's been a significant amount of work in the academic community on doing just that one of the most promising papers I think is is one on adversarial learning where you have one network that is trying to predict an outcome and you have another network that's trying to use that prediction to predict race or gender or whatever it might be and you combine those into a single objective function where essentially what happens is you sort of fight or the models fight where they're trying to maximize the predictive quality well actually making the race predictions as bad as possible and those apparently have worked pretty well and if you're interested in those the IBM has the AI fairness 360 a if' 360 yeah and and most of those tools aren't there I wouldn't put them into production with a bank but they're certainly fun to play with yeah and I just only had really quickly that that part of the framework that Nick was talking about too much to keep in mind that there are there's not only the disparate impact risk but there's just prick treatment liability under the ACO and so you have to actually really be careful about how you use information about race in the model development process and so usually in the context of credit usually the models are developed in a sort of race neutral environment no one has access to reported or imputed race and then once the sort of final baseline model has been developed and validated then there's a series of testing about what can we make we find a less discriminatory alternative that is roughly as predictive as the one that kind of emerged from this race neutral process so it's really careful about where you start bringing race into the into the sort of equation so to speak yeah that's so a lot of subtlety there I think one point that I picked up on from the panelists was in banking and places where a koa and fitrah apply you have to do things a certain way and you and you may not even be allowed to have race in your model race or gender age but outside of that which is a whole wide world of data science there are lots of new and interesting things that you can and should try that are that are very much along the lines of what you brought up I would suggest hiring lawyers before you and I happen to know a few in the crowd that I would would love to work but really to understand the legal implications of what you're doing I think is essential because it's very easy to make mistakes and just because you know I think that's that's generally true well but let's talk about our friends out in Silicon Valley who have massive social responsibility right and maybe that's the one place where your theory breaks down maybe not that's right yeah I just wanted a quick ship we were looking at this you know sharing this the other day via email I don't think when caught all things considered on Monday but Microsoft president Brad Smith spoke to Audie Cornish about this very issue of neat he's you know promoting regulate he's a very strong proponent for regulation in this space and you know it's a five minute listen to it and you get home or whatever it was it's actually very interesting to eat up this issue in a really nice way yeah this i completely agree and so this is one of those places where regulation compliance and attorneys are actually making statisticians better modelers because what I've seen in in the last few years of AI is there it's sort of been a wild wild west where throw as much stuffing to as complicated an algorithm and so you can find even if you don't understand it and predictions look good we're done and all the the compliance folks and lawyers and regulators are saying wait a second this doesn't make any sense and so companies like h2o and very and a lot of academics are saying okay in order to get AI adopted we're gonna have to take a step back and try and understand it and what that means is as we try and understand it we see that you know throwing every piece of data into a static ulm odd 'l then we can find actually doesn't make sense and so i think we're moving into a space where and Patrick I know disagrees with me but what I've seen at least in are you sending in is I then then we're moving into a situation where simplicity and parsimonious models are actually getting a little bit more traction people are realizing that adding the 567 'the variable doesn't actually add value and it only complicates the situation that's not universally true and it shouldn't always be in this situation but that i've seen the movement towards that luckily I don't remember the context under which I disagreed so sounds perfectly reasonable to me I wonder what I was thinking about okay question there or comment it does seem the Android I have an Android so I get a rejection notice now in just trying to cope with the world I might imitate all my friends don't buy androids you know mining which are actually complications in the basic saying credit assessment situations work with money – money happens to adults you know you have credit worthiness is what is to predict just in general oh we find people to wear blue clothing are predictably chomping at the bit so I think I'm a discussion about game ability and like can you can you game yeah and I think the treant the treads but but I think the transparent see part of this kind of salt may might solve that so if you if a lender knows that they're being forced to tell you that you're being denied credit because your shirt is blue I mean then it sort of erodes the entire pursuit of developing a model where people can just sort of easily game I think the you know typing in all caps or typing out the case whatever typing in all caps was a real the cards that that maybe is because these things have 375 variables or whatever that it could be too complicated for the rest of us to see as regulators or sophisticated people this is a this is I don't know if I've ever you want to I don't think I've ever heard someone argue this as elegantly as you're arguing it I'll I'll add to this that there's a there's a whole other element to what you're saying that very few people think about what if I'm trying to trick the model okay which is very scary why you know as a as a relatively skilled data scientist there's probably a a sloppy lending small lending company that I could go to and I could do what's called poison data poisoning attack or adversarially example ill attack where I could where I could very subtly change the data so it makes big loans to people like my girlfriend's mother and and there's it's debatable whether that would even be illegal so so I fully agree with what you're saying and I hope Nick has something smarter to say here or Brice but but my what what immediately jumped in my mind was two things that's that's why I disagreed with you because I think maybe with the clientele that we work with they get it like oh we should be using these Cynthia Rudin type models and the the broader public though seems that it does seem like there's this madness of you know don't lend to people because they type in all caps or you know using facial recognition for this and that so so maybe maybe you were thinking about our clientele and I was thinking more about the the general mood so that's where my mind went and then also the security aspect of you know your your your you may be an innocent bystander but we all I had someone tell me that the first hacks and the internet were in the 1960s okay the second technology is made available people will try to abuse it and and that's very scary when we start thinking about lending and employment and getting into colleges and police using facial recognition and and stuff like that so it's oh it's even scarier than what you're thinking yeah I think it's a real danger and I think it's happening there it's happening here yeah what people be well I mean there yes it may be happening on a larger scale and I and my caution there is you know I I don't like to apply my own East Coast liberal latte-drinking ideas of fairness on other people but you know there's also cultural relativism and stuff like that she's very complicated question I'm not disagreeing with you I would say I would say I want to emphasize that it is happening here there there are notable examples of people being held in prison by black boxes that because the black box was wrong and they couldn't appeal it because it was a black box and and there's other examples too but you're probably right that's happening on a much larger scale other places yes government-driven yeah and that's a real problem and it's it's sort of two problems one of them is bad data and the other one is bad bad results models make errors and if you don't have some sort of appeal process you're in you're in trouble and I think that we really need to grapple with these issues as far as sort of the absurd data example you know blue shirts leading you to to not get credit what I would like to see is some of the regulations updated so that you do hat so the banks and other lenders have to be a bit more explicit so that you can see hey it's because I've got I got rejected because I was wearing a blue shirt right now it's well let me say in in larger and lenders the commercial banks that we all know most of them are not using AI models for credit yeah and most of them are not using alternative data or they're using it in very limited situations where at second what's called a second look so I'm aware of – in the world – big-time financial services companies using a Iong credit lending decision and there's probably thousands and thousands of these models deployed and I'm aware of – so it's a very very low percentage yeah but it's it's growing yeah it's great yeah there's there's no doubt so I think ultimately we as users in suppose we don't know anything about AI ultimately we have to trust the regulator's to look at these models and also trust the companies to do the right thing oh yeah but are your I think that the crux of your question which i think is a very good no one has given an answer and maybe other people in the arts would have an answer is what can you as a private citizen do you know you're not the director or Twitter result very thin this is recently yeah yeah and and even if you have very good models I you can have definitions of fairness that are fundamentally contradictory there's this famous example of compass which was a approve that's the one I keep talking about yeah as it said yeah it's the thing to talk about in fairness and and what it was was there was a program to predict recidivism and what happened was ProPublica came out and they did a study of it and they found that african-americans and I believe Hispanics were much more likely as a group to be flagged as written likely recidivists even though they were not so there's a lot of false positives a much higher false positive rate for minorities relative to whites and I think that hopefully all of us can agree that that's terrible and certainly news to me and it seems like an unfair thing compass came back and then they said wait a second what we are treating people fairly and what they did was they looked at individuals with particular scores and they then looked at what percentage of those people ended up being recidivist and what they found was they among the people who had a 20 percent chance of being recidivists among that group 20 percent of the people were actually recidivists so in one measure essentially just treating similarly people similar people similarly it was fair in another measure this group fairness measure it was not fair and it turned out that those are algebraically fundamentally at odds and so anytime you have a situation where the underlying rates of occurrence are different and what it what is going on here is that african-americans have higher rates of recidivism and we can talk about all the societal reasons for that but just focusing on the compass algorithm as as a driver of discrimination a potential driver of discrimination matter what it did it was going to violate one of those measures and fairness and that's a real problem and in terms of defining what we want to be as a society how do we want to define fairness and I think we need to explore that further any we got about five more minutes let's go there are they possible I think the China the China for all you know the AI may be absolutely perfectly identifying people could be a problem for the government but then they're locking them up in concentration camps about the country so there's nothing to do with accuracy right we nobody I think would argue that the Declaration of Independence which each classifier would be but the China example may be obscene and buttocks may be completely accurate so that that raises another point which is the difference between accuracy yes so yes and I'm not an ethicist you know dive off the stage of this year the EU issued an ethics guidelines for trust with the I and that's a huge expansive piece so when you talk about you know I know you're you've got a narrow little topic and you're on us that's fine I just wanted to know your thoughts on this much broader sort of you know individual organizational approach that we see well I saw one from Deloitte that I thought was pretty good actually I'm not familiar maybe maybe that's where you work I don't know okay okay so so I saw and I'm not an ethicist I'm a machine learning expert some people think I'm an expert I saw I saw a guidance from Deloitte in terms of AI ethics that I thought it was very very practical and very workable and also correct and and I do personally wish that yeah yeah yeah that yeah yeah well iBM has really taken the lead on this and I shouldn't you know yeah yeah yeah so I think there's a lot going on and that's great I think potentially it would mean more if it came from the federal government but I don't know maybe it wouldn't maybe that's not necessarily and I I think I do want to clarify something I said earlier um you know I I don't Google who sometimes gets disparage in this way and sometimes rightly so they also sponsor a lot of fairness research and sponsor these kinds of fairness guidelines and stuff like that too so you know it's really hard to say except for maybe the exception of one or two companies that that you know everything this company does is bad or everything you know I think a lot of good things are happening in Silicon Valley and and so I just wanted to make that clear I think I was kind of unfair about that earlier but in terms of those kinds of guidance I've yeah I've seen some good ones too and I don't know if it's better that it from nonprofit organization a university or government I'm not sure any comment just that yeah just that um you know government is in one monolithic that's right yeah just like Silicon Valley isn't one monolithic thing play you know so Congress can enact statutes you know that you know directs agencies to implement regulations you know regulators need to have the authority legal authority to engage in regulation and rulemaking so they can't just go out and do it over the hell I want to do they have to sort of there have to have there has to be a basis oftentimes there needs to be a rationale so I think it's there's a big sort of coordination problem that happens within God we look to government for guidance and for leadership and that in that way and it's probably also the reason why that like it's only in times of crisis that we see that kind of month they've got that kind of mobilization occurring i will i guess one more one more comment I would make and I'm not maybe I'm not as familiar as I should be with the EU guidance but a lot of the guidance I've seen where it falls short is like where Nick and I actually work which is how do I literally implement how do i code these methods that that make the model transparent or fair that that is i see these guidelines in broad strokes being very effective and and very helpful but I think maybe why Nick and I maybe don't pay as much I'm not gonna speak for Nick why I don't pay as much attention to that is what cause in my work when it actually comes time to to write the code that makes the model transparent that guidance seems to be less useful just my just my two cents so if you want to comment I think what you're saying is an incredibly important and ultimately it comes down to a question not of all the stuff we've been talking about but really a question of how will these models be used and potentially how can these models be used Bryce and I were talking to a woman a couple weeks ago who had developed a I won't go into the details but developed a system and she thought it was going to be used in a positive way and it ended up not and she she really had a sense of deep regret about about herself and I think that that's very important to understand very important to think about is how are these things going to be used and unfortunately I don't know that we're in a situation where we can keep bad people from using what could be good stuff but I think we need to try then that all of that sort of out of my you know we're not I think we will I think we will I think the robots are coming for us I think I think that the level I've been in I've been impressed in the level of social awareness that's you know the increase in social awareness that's occurred in the past say two years what's that sure so so two two years ago you know I would have I was really really scared about this now I feel like if you read newspapers which may be a very small portion of the population that you you would at least be aware like hey companies are doing some kind of sketchy stuff with AI and this is probably something we should be aware of but but maybe you know it's the it's the circle I run in or whatever but but I I mean I do agree that it's a it's a huge danger and I'll say it again if you don't know what your AI is doing then who does know and what is it doing right those were very scary questions alright where we're at time but I'm happy to keep answering questions for like five or ten more minutes do you need to run okay like on the discussion loop so my buddy over here could be a really good guy many two breakthroughs to be expected or needed from from my perspective and I think you know maybe Nick will agree maybe you want a lot of the brake like again two years ago I would have sat here scratching my head and been really stressed out by this question because it was my job to solve this problem and I was struggling but I'd say in the last two years there have been massive massive breakthroughs in the tools and math that we can use to understand AI itself and so now I think it's much more a question of people learning about it people using it it essentially kind of trickling down into the wider economy and education right and I think that's a big I I'd say maybe that's what needs to happen is some of these newer techniques need to filter down into the education pipeline which I don't think they have for the most part the thing that I would love to see the most worked on use on causality there's a guy named Judea pearl who's done a lot of on causality and and has really moved away from night you know the correlation is not causation though that made that very well may be true it's important to test it and ultimately in AI where we have these very complex models if we can get down to models that are really causational or have associative causation or we can get the variables to the point where we're really only using the causal part of the variables to do predictions then we are going to be in much better shape and that's going to influence fairness is going to decrease error rates and it's going to decrease discrimination so I think that bad news that could be really you know it's a good point so and and just to give a 30,000 foot summary of what Nick just said mostly we work off correlation right now and and a really smart person made this point yeah I CLR that the conference I was just at when a model says you have a 0.3 probability of default shockingly rare rarely does anyone go back and check with historical data that people like this actually defaulted 30 percent of the time we just take the number that comes out of the model and say well 30 percent right and so I'm just trying to think of a high level way to explain causality and so you know what we want to move from is is correlation where you know this happens and oftentimes this other things happens to a deeper level of understanding of understanding why and what causes things so that's a really good point one last question you've been patient so so what are the characteristics that would make a model explainable and is there just one characteristic or there are many characteristics and many different kinds of models that would be explainable so you want to do you remember in the Shapley paper the the three things that Shapley is good at are those with those answer the question yeah and I'll jump into good I think that's a good place to start no I was asking you what the three work ad activity missingness Oh God all right I'm so there's a paper on sham that I think we went do a good job but but one of the things about explanations is that it's very hard to even the definition of an explanation he's really complicated there was a paper that came out that I think was that the fat ml conference where it was like ten pages of definitions of explanations starting with Aristotle or Socrates and going forward and so I don't think there's likely to be a single thing that comes out that's going to satisfy everybody all right I'm just gonna I'm gonna read off my phone so that I look maybe less dumb the three theorems local accuracy okay so that means that means that when I say you know your credit score was the main reason why we didn't give you the loan that actually your credit score was the most important thing and not some other thing right so local accuracy being being specifically your explanation being specifically accurate for each person missing this meaning that meaning that that if you didn't have a credit score for whatever reason then credit score wouldn't show up in your explanations so oh all right so I mix it up so the first thing I said is actually consistency alright so consistency where local where the thing that we say is the most important is not actually less important than something else okay missingness the idea that if I don't have a credit score it doesn't show up in the explanation local accuracy means that the numeric value of all the explanations sum up to model prediction so I say this contributed 30% this contributed 40% and this other thing contributed 30% thumbs up to 100% and I think that I think those are the things I just want to add one these one thought yes there's been a lot of focus on the technical solution yes they think it's important but that's all I think there also needs to be some reflection on what the context right so I'm giving this information to you Who am I giving the information to and how do I present it to them in a way that's actually meaningful so I can extract some information out of the model but then how do I actually package that and share that with whomever and then thing in the contact sort or what are the other one of the governing laws policies the principles that you're sort of trying to achieve with that information sharing and I'll add very quickly you know people traditionally think linear models are interpreted well which I would agree with decision trees people often find interpretable and then rule-based models people often find interpret well and you can argue that decision trees or a rule based model so yeah I'd say there's different kinds of interpreting models there there's a lot there so I'd say there's a lot of things that make things explainable and interpretable and people argue whether those are the same thing so it's going on Bryce's point I think that we really need to think about what it means to be explainable and things like are you what do you want to know are there characteristics that are immutable so if you're getting rejected because of your education and education is very difficult to unchanged do you want to know that maybe you do maybe you don't maybe you're only interested in that things that you actually can change and then even among those is a question of is it the thing that is driving you furthest from let's say we're looking at loan acceptance and isn't the thing that that is driving you furthest from acceptance or is it the thing that is easiest to change in and those are two very different things and how do you measure easiest to change so there's a whole host of things they come up as sort of second-order problems as a result of just trying to have an explained explainable model okay so his question was on the veracity of models and I'll give my little spiel on this explanations are not actually about trust they're related to trust but they're not actually about trust because I can explain something to you and then you can decide you don't trust it right that happens often right so so there's a whole new field called model debugging but it goes back to Diagnostics of statistical models and linear models so there's just a whole different tool set for deciding if the models or trustworthy and an explanation to certainly play into that there's their oblique concepts you know they overlap but if if you if your sole goal is trust and you are using explanations you can easily fall short so they're all just you know like I was just at a conference where there was an entire day on just model debugging right so and and to me that's there's just all these new air binding error metrics slicing and dicing the error a million different ways and basically trying to bring some of the best practices from software development so unit testing in in to machine learning so I'd say that that you're right to sort of bring up this question about trust and that explanations are not the best tool for trust actually they're important for appeal that's what they're important for if I told you I stole your car in order to use it as a getaway car from bank robbery you probably wouldn't trust me even with a good explanation so I think that that it's probably necessary for some people to have explanations but it's definitely not going to be sufficient okay I'm gonna cut it off there so thank you guys so much for hanging around on this nice spring evening and listening to talk about regulation and AI a round of applause for our panelists [Applause]

The Open Government Data Revolution



so I see what I want to say dovetails very well into into the previous talk many of the examples that were being given to you about effective modern governance at the city level are going to end up drawing on this foundation this foundation of of open data that we've been engaged in for a while now around the world and the UK was one of the one of the initial leaders in this work and still is trying to push the envelope as I'll try and describe I'm from a university background in that I head up a group a Southampton on web and Internet science but I'm also an open-air data adviser to the government and actually helped set up the original data.gov dot uk' portal with Tim berners-lee back in 2009 I'll talk about that a little bit just a few things to say in fact again the the keynote this morning talked about data is the new oil and people think there is a super abundance of data and indeed there is but the extraordinary thing about the super abundance of data is that in itself it's extraordinarily powerful people think it's an unalloyed problem but if you get the right data organized at scale it becomes a remarkable properties one of my favorite examples is this one from from Google Google's research where they in fact took a search log of took the log queries from a very large number of American users and were looking to predict from the search terms being looked at the outbreak of seasonal flu an epidemic of flu essentially United States it takes about two weeks using traditional methods to get physicians data back to the Center for Disease Control's to actually plot this actual data trend you can see here that CDC data this orange plot here they were able to build a model essentially a a knowledge based model of what terms were being used to precisely match that outbreak and of course they were doing it in the end at real-time okay they could show real-time tracking of the flu outbreak and of course it's because people collectively are going to be searching at times of flu outbreaks as they're breaking out in the community for a particular sets of key terms and objects of interest to them and such like and that's this it seems to me a extremely powerful indicator of how something as fundamental as public health policy or well-being in a community can be driven by this data of course the realistic question is whose data is this and just how easy is it for anybody else other than a very large search engine company to do this and what would the terms and conditions be under which that data could be released back I loved these examples and my other favorite example this is you can just about make out what appears to be a light pollution map of the of the UK Europe actually each of those luminosities is a is a geo code from a Flickr upload a picture so each of those points of brightness is a Flickr upload and in fact if we look at that at higher resolution what do you see there it's a map of London there are the major brought bridges crossing River Thames you can see the major thoroughfares you can see these densities here every one of those points is a Flickr geocode photograph and of course obligingly people who take those photographs have been busy tagging those photographs as well so when I get this freely opened a available data from Flickr which I can do and download it this is one of John Kline birth students did it comes already marked up with the top most frequently photographed and labeled tourist destinations in the city now that level of immediate intelligence rendered off a very low level data is a world that I think has huge possibilities and opportunities for us and I haven't touched government data per se interesting to think though of the range of datasets that can become available for us to use and exploit my other example I often use is this well-known example of a many people have still not still still it's news to them this is a map an open-source product called OpenStreetMaps map of the port-au-prince the Haitian capital before the earthquake there was no map of that capital city bad news when your capital city has been destroyed and you've got to work out where to put relief they actually crowd-sourced a construction of an incredible high-resolution map for this here it is in 12 days 12 days because people were on the ground with GPS receivers and laptops uploading those data coordinates to an open source platform with open data formats open licenses and when you see that happen you realize that we can truly crowdsource in which the way the same way we're hearing earlier remarkable intelligence around city cities and environments we live in so the power of open I believe is very profound indeed and the exciting thing is that we're applying that now to government data itself this is the state of affairs in about November 2009 when Tim berners-lee and I were asked by the prime minister to start opening up government data in the UK we produce what we called the postcode paper here's a postcode we published at the Guardian newspaper headquarters we took all sorts of local data public data nationally government and local government generated data and made it into a newspaper with respect to that postcode the problem was that 80% of the content of that newspaper was illegally reproduced illegally reproduced even the post codes we weren't allowed to use in that form thus we'd have had to pay or Ordnance Survey for the privilege of recording and using that piece of information so there was a lot to shift but the dial has been turned and in fact in three months we had our first portal data gov dot UK up and running it was you can open source software it does something rather heretical in terms of government IT it was a beta site in constant development and in just 24 months we were actually we have this site here David of the UK where you put in your postcode chillie access the data sets that are available for that particular region postcodes about 812 residential addresses you can now find out data about the crimes occurring in that area the educational attainment Raceway the bus stop saw a bunch of stuff okay and that has been happening because we've had a real sea change in the whole approach to data release and publication we had some friendly competition along the way the u.s. in particular began this work back in 2000 and the Obama administration's released in 2009 first executive order just about on openness said update gov we followed suit a little later in 2009 we now have were over 8,000 data sets 8,000 data sets available on daily gov of course the granularity with the egg sets an object a much friendly competition we count entire maps of the UK as one data set if you parcel them up we can get a very good score on the mounted a key making available much to say about that the interesting thing about our support for open data in government is that it's been led from the top from middle out civil servants who are engaged in this and from activists the top level political support we've had has been really important nearly kroy's here the vice president of the European Commission don't get the hang of this guy she was actually extolling the virtues of a European open data just a few months ago I'd be very interested to see how much we actually materially get released because despite all of this goodness there are challenges around open government data that I want to come on and address the reasons for doing this or it's a powerful idea whether it's mapping a capital city that has no detailed maps or finding out what the state of public health is more looking for snow falls and working out what the fixing streetlights there are many examples this is a photograph of cholera bacteria so famously when a particular surgeon in the 19th century mapped death rates on a map of London they discovered that people died from cholera we're all clustering around particular water well you know they didn't know that cholera was a waterborne disease at that point it changed the whole perception of public health similarly this is a picture of mrs a this is the hospital acquired infection that does for a good number of people up and down the country certainly used to do for a lot more and then we started to publish infection rates and death rates in hospitals as a leak table and of course that data led to a rather dramatic change in behavior at those hospitals and it was one of the major instruments that led to a sharp decline in hospital acquired infections that and deep cleaning and other other actual policy actions the deep cleaning of course were that people were seeing very clearly the effect an impact of this sort of information and actually we talk about transparency and accountability improving public service delivery improved efficiency these are all reasons why you would want to release open government data there are these and we again heard them in in previous talk around engagement citizen engagement but also we get data improvement governments data is no better than many corporate datasets when we published bus stop data where the bus stops in the UK were finally got the UK to publish those 360,000 bus top positions 17,000 of them work where the government thought they were you know which is a tedious if you're trying to build an app or turn up for a bus it very soon after that was published a crowdsource site was developed where people could enter the actual positions so now a challenge for government is how it does open government to point naught how do you write back data in a way that it becomes in a sense official data data that has a provenance that is both backed by the crowd and by government but also we're seeing it in terms of of economic value and societal value and I'll come on to that in a moment so open the data and people's experience is that the applications do flow whether they're flowing fast enough or whether they're making the difference is of course a question we're now asking ourselves two years into the experiment so all these good sustainable citizen engagement tools are these tools to help us manage understanding of how a city is function or a nation state or a region how do we drive both demand for the data utilization of the data build the ecosystem around open data and maybe we're going to discover that actually the data releases that data like everything else in this new economy has a long tail and that some datasets are highly reused by very very large numbers of apps and people and some data has a bit of interest to a very small constituency but remember the lesson of the long tail is that an awful lot of utility and use lives under the bottom of the tail distribution okay so just seeing that your data set is the most used does not mean that substantial amounts of data don't have utility and in fact the assumption we have in doing this work is presume to publish make publishing the default and then unanticipated reuse makes much of the rest of the magic that we observe on the web a fact for open data so we get data at all scales it's not just City as its regions and it's not just yet nation states its regions and cities that are releasing and here we've got examples of Redbridge a Regional Council in London London's data store itself all good stuff and increasing numbers of countries from Singapore I just returned from Singapore this this last weekend looking at their data Kenya Chile the english-speaking democracies a whole range of open data efforts now growing up and we achieved a lot we can say I think that we have seen significant data sets released that the licenses that are essential to this certainly one of the lessons from the UK data release is you've got to allow your licenses to be unrestrictive not surrounded by minor terms and conditions I go and see lots of data sites that claim to be open and somewhere in the background of a particular chunk of data there's a little restrictive covenant you shouldn't use it to do this or you Shawn use it to do that so we can use it but you can't use it in a commercial reuse context open is open look we've seen developer communities grow up and we've seen a degree of international collaboration start to emerge all good things and there's something particularly compelling about the city or than or the or the urban conurbation as a data user a lot of the cities get open data and have been some of the earliest advocates and exponents many of your best apps are urban rather irritatingly the apps are good but then they kind of run out when you pass the city limit you we've got some great examples in transportation in the UK which work great in London because the mayor and TfL had the mandate and authority to get the data out there get across the city line and your immediate boss find a best route boss finder app Forster pieces okay so urban conurbations have a kind of a coherence though such that if you're there it's still good news for you because they have authority over their data and there's a network effect all data sets have a network effect but as we saw again in the previous talk around transportation utilities education public service provision data sets tend to supplement and support one another if you're trying to work out where you want to live buy a house you'd like to know about the crime rates at how effective transportation is where the actual schools are how what are they doing people can make decisions both at the governance level and in terms of an individual citizens choice because of the interconnected nature of much city data and it always comes down to location location location so it turns out that geographic geospatial open geospatial data is a lynchpin and whenever we think we've got enough data openly available there's some other data set that people want and the only ongoing ruckus at the moment in the UK is for a comprehensive address file that will give you the actual register that not the people who live at the addresses but the addresses of all the businesses and all the people who would be be visited or submit a census form for example a there has never been a comprehensive list and be currently the proposals are that you can charge for this will charge for it but the amount of location-based specific services that will be empowered by a release of comprehensive addressing data I believe would be would be very large indeed so although we've achieved a lot in the UK there's always more to go for in my opinion and these are some of the products that the open that the Ordnance Survey now support for mapping and good they are I mean we're in it we're in a very much better place than we were just a couple of years ago and these get routinely used in a range of open data applications so looking at London again here we have a rather good illustration using the open OS open data mapping product and what we can see here is essentially thicker lines are more journeys by by hired bicycles and the red blotches you can see and you can see this inspect this on the high-resolution our pollution levels measured by LED emissions okay now this has been put together by a team at a spatial analytics Research Unit at UCL they're doing this on a weekly basis people looking at new kinds of information mashup that will have a direct interest to you the rider of a bicycle in London or you the public health consultant this is a similar example and both of these very recently made available just in January this year this is a map of what is called multiple indexes of deprivation basically an indication of how wealthy or affluent or not so affluent a region or an area is now Charles booth in the 19th century actually built a wonderful map of urban deprivation in London literally visiting every household doing a survey much of the same insights can be derived from data that is now held and now is openly openly published these become important policy tools important planning tools important tools to mobilize a community and this is this is data on London's daytime population the remarkable thing about this just taken from the London data stored undertaken by a researcher in Sheffield the daytime density of the City of London is 350,000 people per square kilometre okay extraordinary about 11,000 people live in that area or registered as living there but you start to see these peaks and flows these ebbs and flows this kind of sense of what you can learn from the statistics made available and in a way the one that is perhaps most about most most impressive because this holds politicians feet to the fire we've had spending data published in the UK at a very low level of detail 500 pounds every month in excess of published by every regional authority 360 of them gives you an exquisite picture of what's being paid for by local authorities what's being spent whether one authorities paying more for its fleet higher than another for example but this is giving you by crime types by street level every month reported crimes in the UK for England and Wales for England and Wales those are the Constabulary –zz that have signed up to this you type in a Scottish postcode you get nothing back ok which is an interesting issue I think if you're a Scottish citizen because what does this tell you well it tells you this is actually an application my group built in Southampton this is a heat map essentially I've taken the excel this is the H EE 16 one a a which is the postcode folk for the Excel Centre and I'm visualizing here lohi it's a heat map reported antisocial behavior okay this is a months worth of data this was the first set of data published in December 2010 and you have a bit of a filmic experience and I'm just going to scroll through that's December January February March April can you see a certain constancy in the location of anti-social behavior and where it's occurring and what that might do to your sense of what you police or pay attention to if you're a resident who you complain to or who you try and get a sense of what's happening here why antisocial behavior is this if you actually knew exactly when that was reported you would find because we know this exists exquisite temporal periodicity Friday night's particular time you'll see a bunch of so antisocial behavior in a bunch of particular places that are associated with of course checkout times at the local pubs or entry times at local nightclubs some of this stuff isn't so surprising but as a tool and I could have visualized burglaries or vehicle crime shoplifting this is a tool for empowerment but it's also a tool suppose you're an insurance company what are you going to make of this data what we would start to think about of course if you're then a campaigning group for for the digitally disenfranchised you start to ask yourself but you know if we start to have postcode insurance premiums how do those who don't have a voice get their voice heard but the data tells you the story very clearly and other countries are doing this of the states Singapore as I say don't have everything solved you'll be glad to know in the in the world's intelligence city an awful lot of their data is not available on the unrestricted licenses and a lot of it is simply national statistics and if I look at their crime data all I can find are sets of numbers by month across a huge swathe of the cities so I'm no real insight as to what's happening there and he did there's great stuff in the US but there's no federal equivalent there's no coverage across across the country so complex this is it's good we've been trying to work in the UK to improve the quality of data not just in terms of what's there but how it's linked together data can be linked and place and space are good places to do it so our geography allows us to if you can represent the data in the latest open formats used on the web for linking data we can begin to link other datasets together much more easily this is the ambition of so called linked data approaches on the way talk more about that perhaps in the session but it does allow us to produce now visualizations this is a particular post code in Southampton and we're just looking here at the post code this is the immediately surrounding post goes to the north and south and east and west and then the sets around those concentric post codes we can look at crimes crime types we can look at transportation access points we can look at educational attainment absenteeism from schools we can begin to do a range of information integration that was only ever available if it was asked for by policy makers within our statistics officers now we can argue about whether we're making the right interpretations around this we need new cadre of data literacy but it allows us to have the discussion and it allows us to powerfully think about how we might exploit it so this does amount to a gray revolution and in the UK the process is continuing we have significant datasets in transport and weather believe it or not in the UK you didn't have open access to weather data the predictions four days five days out every three hours the Met Office publishers for 5000 points in the UK three early predictive weather now you can look it up on it on their website as a picture but you couldn't get the raw data if I get the raw data I can build services around secondary insurance for rain insurance or events planning as a million things I could do that don't require me now just to go through one the third point of access the Met Office we're going to see rather dramatically releases of health data everything from what GPS are prescribing every month that drugs they're prescribing through to the outcomes they're detecting how does that vary by postcode how does that vary by maternal but by multiple deprivation indices this is powerful stuff and most recently particular for Tim berners-lee myself in our role in this work we've had announced an open data Institute to be funded based in Shoreditch not just very far from here at all to look at the commercial potential and exploitation of open data so how can we take those micro businesses those startups and and help build businesses based around these kinds of data releases and drive more data out of government how can we use the experience we have to get public services in the public sector to deliver its data more effectively for for reuse and how can we educate and help developers in corporations large and small to live in this open data environment that's so fast evolving and again speaking back to the original keynote this morning transparency and data open data to possible important components for capitalism to point naught and I think it really is an interesting challenge to ourselves is that the case thank you very much you

Precision Public Health Summit: Leaders Voice Hope for Change



thank you all for joining us at the precision public health summit here at UCSF we hope that these experiences will inspire you to the possibilities of how we can partner to ensure that all children no matter what their circumstance have the best opportunity to survive and thrive the precision medicine initiative is one of the benchmark efforts of this administration to unlock the power of data to create new scientific discoveries being able to actually focus that on population health and prevention is one of our key goals and it makes sense to start with the first three years of life that's the most important and vulnerable black babies die as more than twice the rate of the general population in the first year of life they die because they are born too soon and too small a bloom what we're aiming to achieve is essentially designing the future of prenatal care with technology to improve the health of moms and babies and what we do is we combine wearable devices with data analytics to both reassure moms and provide doctors with better information to improve birth outcomes I think we're at this very unique intersection of data technology and naturally the question is how are we using that to think about our own individual Lots we should fundamentally believe as a nation that a technology is neither radical nor revolutionary unless it benefits every single American we're very good at building really unique creative technologies we have to make sure it benefits everybody simultaneously to really provide the value proposition that we have to have going forward into the next great generation we actually believe the same types of innovations can really make a difference in public health in population health and that's what we wanted to do with the summit is bring together thought leaders from various different sectors to really explore how these innovative ideas can really be applied to address public health challenges what we do know is that that we can be exposed to environmental agents like toxic chemicals and that they can have profound and important influences on our help my family was suffering from different health issues we were losing her hair we had these rashes I my one son has a compromised immune system he wasn't gaining weight and we weren't putting it all together right at first until our water started coming through our tab Brown you have to fight these agencies you're paying to protect us and you have reprisals it adversely affects your career but it is either that or sit by and let bad public health bad science and engineering be used to poison little kids what more data have been helpful to you yes and to realize that there was a problem there instead of trying to hide it my hope is that this summit creates new ambassadors and leaders throughout the country and the world who can carry that message not just a precision medicine which is the ambition to bring all this great technology to improving health but much more importantly a newer more profound I think equity message that can directly impact human health and well-being that includes everyone no matter what their zip code no matter what their geography that in fact precision public health and improve health for all you

Making government better, through data and design | Cat Drew | TEDxWhitehall



who here can remember being a teenager oh all of you brilliant in 2010 there was a London borough who was thinking quite a lot about teenagers the teenagers in this London borough were no longer coming to their community centres they use the data of declining numbers to make a case to invest in new equipment in computer games in football goals in table tennis tables and yeah still people weren't coming teenagers weren't coming a puzzle a few decades earlier BT was also having a puzzle they had created this amazing new customer service the first automated telephone directory service you could have your number speedily differently and yet again no one was using it odd these things fascinate me because you've got data creating insight or speeding things up and yet something's missing now I'll come back to these at the end and maybe you can think about what the answers are as I go through my talk as a civil servant and the designer I've always been nerdily interested in both analytical stuff and also much more creative stuff when I was really little I used to go to my friends houses and with my imaginary friend Jack we used to go round but not to play but to tidy people's rooms and now on my shelves at home all my books are very neatly ordered but not through alphabet but through colour I can see them yellow orange red purple green and blue ordered but beautiful and at school I won the statistics prize for that crucial scientific discovery that blue Smarties for though a normal statistical distribution that's great but also that's you can eat 20 packs of Smarties in your lunch hour and now I've been a civil service for 10 years we've been working the big departments of state like the home office cabinet office number and number 10 in very traditional very important cabinet policy-making roles but all at that time I've kind of as now had this niggle this hankering to do something a bit more creative so at school I rebelled if you can call it that I did my art GCSE in my spare time after school than Wednesday's and at work I after take 8 years thought I've had enough I'm gonna pack my bags and go off to Berlin and become an artist artist in the day 7 cocktails at the night but after two years of poor artists life I thought I had to come back I kind of missed the really amazing uses that government could put words and numbers to things that can make society better so I came back but I didn't want to miss that creativity and I didn't want to keep flip-flopping between analytical and creative stuff all the time I wanted to combine both so I was a policy maker and I also studied graphic design and then it became apparent to me that you can combine both two women in particular really inspired me Florence Nightingale she presented data on diseases in the Crimean War and for the first time revealed that most of the deaths were actually preventable and that changed the course of Nursing Phyllis Percel she walked 23,000 streets in London here and she created what we now know and love as the eh-2-zed so both of these women were designers that use data for social good and now I am so lucky to work in policy lab where I get to do this stuff every single day policy lab was set up to support departments to use digital design and data techniques to make policies better we combine data science which uses really powerful computer techniques and applies them to huge amounts of data really complex stuff and we combine that with ethnography which takes human experiences and behaviors and emotions and really tries to understand why people do what they do and we bring these things together we combine them and we share them with a diverse range of people to come up with amazing new ideas to make government better our first project was on policing in the 21st century supporting victims of crime and there was one woman let's call her Jane Jane was a victim of anti-social behavior and she was told by the police to keep diary and so she kept her diary she showed us where she kept it her bedside table and when she filled it out the last thing at night before she went to sleep can you imagine how much that must be for Jane and how much better of beef Jane if she could have something online that she could share this information with the police as soon as it happens so the police can start solving her problems now there were many many more rich observations like this and we shared those with a group of diverse people from chief constables police officers neighborhood watch members and they use their human creativity to come up with a whole range of other ideas they came up with ideas for young people to report client crime using minecraft or for older people to be able to sit on their sofas in their living rooms and give it evidence at court now all of these things are a bit out there but they gave us the creative spark so we could create online crime recording but to take that from a small pilot in Surrey and Sussex and to scale it across England and Wales which is what's happening now we need a data we needed data analyst to help us make the case that this would save 3.7 million pounds per year and 180,000 officer hours our second project was around health and work so in the UK you've got 2.5 million people on health related benefits and that costs us 15 billion pounds per year but we know that the right work can be really good for people in this project we combine data science and ethnography throughout the data science showed us that people are more likely to go on health benefits if they've been in their job a really short amount of time and the ethnography displayed that it was the relationship between the lie manager and their employer that was actually critical whether someone stays or goes the data science showed us that women with depression a much more likely than men with depression to stay in work and this played out in the ethnography another woman let's call her Vanessa Vanessa had been battling with depression for a long long time she had been too scared to go to her boss to do anything about it she didn't feel she had anything to show him and then she got breast cancer and can you imagine what she said to us that she was relieved that she had breast cancer because then she could go with her boss and show something physical and she got time off work and she was able to deal with both illnesses successfully so throughout all that project we combines the data science and ethnography they were always talking to each other sharing their hypotheses and confirming them and we built up this really rich picture of exactly what was going on and again we used that we shared that with people and we came up with lots of ideas for how to support people to manage their health conditions and work which we're now testing across England and Wales data and design require therefore a new type of policymaker when I first started a civil service I didn't know what a policy was and I certainly didn't know how to make one and for those of you in the room who do not know what policy is it can range it's a government position on something and it can range from anything very specific for the amount of benefit that is paid to a 70 year old lady who's also a carer all the way through to whether or not we go to war or not now at the time when I started we were called generalists and for me I thought we had to be masters of everything now I soon realize that that is not possible at all and I remembered someone saying to me a good policymaker doesn't have all the information but they do know where to go and get it great I thought I can get all of this information and I can come up with all the ideas in a world of data and design that's not true data and design can provide the information but it also can come up with the ideas so a better definition is that a policymaker doesn't have all the information nor the skills nor the techniques nor the ideas but it does know how to bring people with them together they need to be able to work with data analysts to spot patterns and data at the same time is working with ethnographers to really get underneath that data and explain why things are happening they need to be able to work with data scientists to automate really clunky bureaucratic processes but also to be able to design them so they actually fit in with people's real lives and they need to work with graphic designers so they can visualize and make accessible very complex data that the civil service loves and share it back out with the public so we can all generate ideas together now not all of us are policy makers or designers or data scientists but we can all use a data in a design approach in our lives you might be someone who loves Sudoku but can't draw a stickman or you might read someone who sends out lives in art galleries but can't add up to save their life we're all using all the time are creative in our logical selves take renting or buying a house you have to make a cost-benefit analysis make sure you can afford it but also you need huge amounts of creativity to turn your house into your home this is important because data is our future right now we are generating 2.5 quintillion bytes of data every single day that's 25 with 17 zeros after it every single time you go online and search you use your store card you tap in with your oyster card you are creating data and experts think that in 20 years from now we're going to be creating a hundred times as much citymapper is an app which uses data government data to tell you how to get from A to B great there's lots of other apps that do that well what's brilliant about it is it uses human stories and human needs to present that data in the way that we find useful so rain safe is a service which not only tells you how to get from A to B but tells you the driest way to do so so if now we're going to have apps that will help us get from A to B in the driest possible way in the future we're going to have autonomous vehicles who can drive us there for us if now we can use our fitbit's on our smart phones to tell us how many steps we're taking every single day in the future we're going to have smart fridges that will monitor our health and order in healthy food for us and if now we're just about starting to get elderly people to remotely share their blood pressure with their GPS from their homes in the future they'll have remote robots companions to help them do that so data it's going to completely transform our lives in ways that we can't even imagine but we have to make sure it is well designed data after all is human we all generate it we're the ones who give the data mostly and we're the ones who do something as a result of what the data tells us so let me take you back to those first first two stories we have community centers and telephones a London borough is having all of its trouble with the teenagers not going to the community centers the data showing numbers are declining but no one could understand why so they've got some researchers to go out and actually spend time with these teenagers finds out what they do like doing what they don't like doing and what did they find well not surprisingly girls and the most part don't like computer games football girls and table tennis and the boys the boys actually prefer hanging out with the girls so very very simple story that boys mostly prefer hanging out with girls explains the data the community center was able to invest in equipment for girls and numbers went up and BT who had this amazing new speedy automated service for the public no one was using it but they didn't trust it they did not trust that a computer could look up a number so quickly so someone had to have that aha moment of going we need to build in trust we need to record the sound of someone flipping through a ginormous phone book and we'll play that to them while they wait to their small amount of time people believed it people started using it so let me now leave you with one final thought it's data the new oil if it is we have to treat it with so much care we have to make sure that we're using it in a way that humans would want and design can help us do that like a hybrid car we need data and design together in combination and we need hybrid policymakers to help us do that thank you

Find, Use, & Govern Data withIBM InfoSphere Information Governance Catalog



today we're gonna look at how IBM information governance catalog makes it simpler to find use and govern data our marketing team needs help identifying smartphone buying trends for this we'll need to locate data sources for customer purchases and then compare them against other data from the supply chain and product launches since we don't have the specific table names we can use the catalog to find assets by searching on a relevant business term in this case we need customer data the results from our search can include related terms tables and reports since we're looking for customer data this customer sales table looks promising we can hover over the asset and get a quick overview of it if we select the asset we can see more details like the business definition and structure to get a better understanding of how it's being used in the organization this information helps build trust that this data is what we need if we have more questions about it we can ask the data steward for additional context we can also explore the lineage of the asset to see where it's coming from how it's been used in other places and other processes or applications that have used it here the data lineage shows that the customer sales table was derived after various transformations and filtering now that we've identified the data and are confident it's valid we can add it to a collection we can repeat this process using other business terms like sales and discount to search and collect as many data assets as we need as we use these assets or add new business terms the governance catalog updates its records this is just one way organizations can gain value from the information governance catalog visit the link below and download your free trial today

AI, Big Data, and Data Governance // Stan Christiaens, Collibra (FirstMark's Data Driven)



as Matt briefly introduces for a data governance software company and we have this niche audience in a way of chief data officers and data stewards and data Czarina's and and the like so we have the sort of adapt our message a little bit to the variety of audience which is on the one hand technical as well as business if I understood correctly right so I'll try to give you as as good as a story as I can or a multitude of stories and if there's questions at the end I believe you have like five minutes of questions so first let me talk to you about the frustration that I've seen with you know companies and when it comes to getting a value of data so if you you know if just like this guy here I forget your name I'm sorry but you know you're trying to find machine learning experts sorry and you know data scientists and all that stuff and you find that there's not a lot of good ones out there but then you let's say you find one or you find a team then you're gonna actually hit their frustrations which are many but there are two very important ones one of their first frustrations is I can't find the data right that's like the biggest problem where's the data give me the data I'll put the models on it then one they have the data then the next problem appears and then it's all you know they make all sorts of classifiers beautiful visualizations training data sample data what have you and then they produce beautiful output whatever they produce models classifiers but then the organization sort of does nothing with it right they don't make a product out of it they don't make a service out of it they don't change the business process so these talented people that you then hire are becoming demotivated and will actually go somewhere else just because their work is not actually adding any value to the business so I'll try to talk about some of these topics and I'm going to put that in the context of these seven predictions we did with calibra about a nine months ago I would say and I'll see if I can remember them and I'll tell you which ones were actually complete wrong so the first one is about the rise and fall of the CBO and chief data officer and I'm going to again play on you know the email story so you know we think from our viewpoint as a data governance software vendor that's achieved at officers on our own the rise which they are but the question is how temporary are they actually going to be and then I learned from one of our customers who did a start-up twenty thirty years ago when email wasn't around that back then they actually had messaging systems that they bought and sold and they had a chief email officer back then nobody has a chief email officer right now right so the CEO how long will it last it will still grow but how long will it last and second data will require a system of records right just like a chief financial officer as a CRM ERP or PP of sales as a CRM system data will have the same kind of meet three data education will explode and it has right I think there's another Belgian startup called data camp who has a million students opening data science so there's a lot of data learning out there ideally this translates all the way through not into the technicalities of how to make a Python script but actually into how does data get into an MBA program for example right for and the predictions was or was it again the data data citizens who are other people who use data to do their work in our company product managers or data citizens for example sales ops and data citizens they will rise up against the data dictator that's a chief data officer who tries too much to control the data and doesn't democratize enough for people to actually get value out of it so that will also happen what are we now five the Internet of Things will disrupt business models of course that's already happened we were too late with that one data protection will overcome data privacy especially with the European GDP our protection rule and then the last one I think we had was all about that the blockchain will emerge into seventy so the last one at least from our vantage point we got completely wrong because when we were saying the whole blockchain thing to our at our user event a couple of months back half of the audience you're not doing this right now right that half of the audience was actually googling what blockchain actually meant so for us we got that one completely wrong and what I wanted to I use this story as a context for what we actually missed because the one that we missed as I understand it is a very popular topic in this audience and it's very simple right the one we missed I would say because of my silly reasons it's artificial and that intelligence and machine learning so I'll tell you why we didn't put that on a map because I'm a little bit of a skeptic and you know we had the AI winter in the 70s and the 80s when the government funding dried up and then the first commercial applications failed and then again back another story from Belgium my home country I've been living in New York now for three years we had that whole natural language processing event that happened where he had flange language valley and that boomed and busted all right so there's a lot of cycles that already went through AI and from that few points I didn't believe that it would hype as much as it puts this year so that one we got wrong now why do I believe that it did hide this year or is exploding this year multiple reasons one the processor power if you if you've seen it for example Nvidia has increased its stock price by four times over the last 12 months because everything is GPU driven right now matrix operations to more data everybody knows that right there's more data out there to actually apply your algorithms on and three and this is a belief of me that could be wrong is that actually the big tech firms they are Amazon Facebook Google Microsoft and all the others they're using AI and machine learning as their next feature war like who will have the best platform to build on so that's why we believe we should have put this on because we're a little bit of a skeptic we would also like to you know give you three pitfalls to watch out for in their session when it comes to AI and the first one is Harry Potter here it's not a magic wand maybe this is the most well picture we could have taken right because Harry Potter's not AI arrow but this is about how we as engineers tend to look at technology as the thing that will solve all the world's problems and I got another story from that when we started 10 years ago our professor said you know back in the day there was a spin-off at the University and they were all gung-ho about object-oriented programming so that was going to be their company object-oriented programming that was a differentiator so it's an engineer sort of computer scientist looking at technology as a differentiated but how does it actually add value to the business the funniest thing about this story is the name of that company you know what they call it soft core come on right so with respect to the magic one aspect of AI at the moment I would say don't you know don't expect that Johnny Depp is going to show up and turn into some super intelligence controlling nanobots all over the place you're going to find most business applications of AI currently in very specialized applications but for a very specialized business problem even south-south driving cars are very specialized right you can't have that same algorithm or machinery that you produced drive a bicycle right it's going to have to learn all over again the way you recognize faces is different from the way you recognize all these things for example so a I will have business value first in specialized applications just so look for it in your business in that very clear acquirement how does it add value how does it reduce cost how does it reduce or mitigate risk that's where you have to look and if you don't believe me just think about the cost of doing AI but even I believe Google when came out with auto ml I think it's called they did this experiment where they had one neural network learned what the feature or the configuration should be for a child or a slave neural network and that's one experiment took 800 GPUs about several weeks of calculation time just to run one experiment so you don't want to invest that cost if you don't know what the value is going to be your electricity bill is going to go through the roof right so that's one thing to watch out for the second thing to watch out for with AI is the salesman's pitch so it's it's all about doing doing your due diligence and I'm going to use two examples here that have had a lot of attention in the media and I don't wanted this IBM to the company right but I am going to use them as an example you know how they did this big thing with Watson winning Jeopardy and so on and so forth so around the same time they also did this big announcement that they were going to solve cancer together with the MD Anderson Cancer Center now several years later and 60 million dollars down the drain that initiative failed it actually failed I stopped doing it and they went back to market why did it fail because they were having challenges in connecting Watson to the electronic health record system but he fundamental right if you want to get some data going in there and second they had too little good data turns out that all these papers that are out there in the field about oncology and whatnot that there's actually just a very small subset that actually has very curated and controlled clinical trials that have the right amount or the right data that actually feeds into the algorithms and then there's another story that was a sort of a failure of machine learning that maybe you all know is the Google Trends the flu prediction if you noticed or a few years old they predicted based on search keywords in Google that the flu was going to do an outbreak and they said they will do this better than the CDC or faster instantaneous so they did that and then it worked until it didn't so in 2013 they had a mismatch of 140 percent prediction versus the actual situation in the world and again why was that because of all sorts of basic checks right they had their model being over fit right they had they didn't take into account that the data actually changed they changed google search suggestions in the mean time so that changed the data that was produced that it wasn't consumed by the algorithm so again basic things so don't fall into the snake oil salesman strap and please do your due diligence on the technology and then the last one is the algorithms so our belief is that AI and machine learning you will not win this war by having the better algorithm the differentiator the value proposition is not in the algorithm it's actually in the data the algorithms will be open source I don't know if you've seen data scientists or machine learning people in action but that's typically sitting in Jupiter or Zeppelin typing in Python commands in these notebooks and I'm immediately seeing a classifier or a visualization so it's pretty cool right but these algorithms themselves the neural networks there's their open source google actually acquired skagle right which is all about open sourcing the models on certain data science problems so you're not going to differentiate yourself with the model so you're going to differentiate yourself with the proprietary training data that you actually feed into the models why do you think Google has been buying data acquisition companies for years for a lot of money what do you think Google makes all these weird devices like this backpack that scans the street or this car that scans the street they're just or gnashed right the 1984 spy cam that you put in your in your house that sort of stuff they do this so they can get proprietary data that is making all the difference so our view is on AI is that data will differentiate how you will succeed with AI and machine learn and then you come full circle to the beginning of my story which was about that frustration how can that data scientist or business analyst or whatever you call that person how can they actually get the data so their their questions typically go as follows give me the data I by the way that question just pisses me off when a data scientist comes to me and says I don't have data then I tell them go find it right or make it it's not an excuse it's not an excuse just go get it it's part of your job right then if you need to write a Python script or hack into the database of your own internal company just do it get the data there's no excuse anyway so they can't find the data when I then can find it they cannot understand the data they don't know the business context they don't know how to interpret it which could be pretty basic right because bias is lying is like a snake in the data graphs if you will if they then can understand it they don't know where it's coming from the linnaeus people often say if they understand where it's coming from they don't know what's wrong with it and if they don't know what's wrong with it or they do know that everything is ok they don't know who to actually call and ask about their data they don't know the data owner they don't load the data curator it's all these are problems about finding understanding and trusting data that are so commonplace in our view in AI projects and not just any AI projects because the way we see it is that this is all a people problem right this is all being done in multiple places multiple data projects all over the map and it's all people doing ad hoc what we call the W the digital equivalent of wd-40 and duct tape which is Excel spreadsheets meetings etc one minute all right and so they people do this in AI they do this MBI in analytics all the same things right all the same steps they do it in Big Data internet-of-things projects data quality project GDP our regulatory compliance projects and so many more so it's like this firefighting around data controls all over the place all the time it's just disorganized it has all the symptoms of a broken down business process and data today if you treat as a strategic asset should have its own business process so that's my last slide I put some links up there for you to read all about this and we have our University at a bottom if you want to do some free learning just as well that make the time yes very nicely done thank you tell us actually as a person like it can you bring back the last slide yes okay people can take pictures tell us a bit more about Calibra the company so you started alluding to what you guys do at the end but tell us about data governance data cataloging whatever needs and what does the product do well you know it thanks Matt that would be happy to do that although I don't know if the audience will be because typically if I tell with clear up does and I say that we're a data governance software company then everybody's eyes just sort of glaze over our governance that's not interesting with our job as the company math is actually to try and make governance sexy or at least sexy as you can make it by focusing on both parts of governance so in our view governance is about the control and enablement of any and all data management activities right so enablement as much as control and in that context the finding understanding trusting the whole collaboration of data just knowing who to call about what data domain all of these are aspects of governance and that's what we've been doing since I was talking to Bob earlier or sorry to mantis or Shawn sorry and since 2008 it's almost a decade I've been doing nothing but data governance and cataloging since 2008 there were a difference between governance and data catalog me the catalog and you don't explain maybe put in our view yes so what we've done fortunately been quite fortunate in the sense that we've been able to shape the data governance category which is still in flux and in our view the data governance category does include cataloging and my view on this mat is very simple if you have a data catalog which is really like a listing of all the data sets and attributes dictionaries that are out there without governance around it it's just like a phone book it doesn't do anything it's not controlled it doesn't work it's going to stop working reason why I say that is because I've talked to all the high tech companies like Twitter Linkedin etc and they all have their catalogue projects many of them FL actually open sourced catalogue initiatives but then you find that nobody's actually using them right there's no enforcement to use them there's not enough enablement and the first time you hit the catalog you get all these questions that I add that I mentioned right ok but then who's the owner could I call what if I have a problem with the data how do I get access to the data who oppose my requests you get into governance questions right away let's question from you and open it up to people from a governance standpoint is the right approach of data late where you centralize everything and then you put governance on top or is it a more distributed approach where you have not believe the data in viola original repositories but then I guess you need more agile governing software I would say that the that the whether you put all your data in a lake or you put in a warehouse or you keep it separate in their applications or whatever you store it it's more a function of the requirements and the data engineering that follows from it for example if you need a lot of scale and limitedly then maybe a lake is the best solution if you don't maybe something else is the best solution so in that sense that we are less worried about the data architecture we're more worried about how you're going to coordinate all these people that make or put in place the data architecture that satisfy your needs what we do see a lot right now is indeed the centralized data Lake approaches but again I'm being skeptical I think that's just just it's the new thing to try right so many companies are sinking so much money in one of the big three distributions and then you know there's two years down the line and people are actually asking you know what are we doing so again just me we have two questions and one equipment coming your way in just about ten seconds thanks enter I'm curious about what your thoughts are on sort of the ownership of data with concerns with that sort of individuals right to data privacy and I think this is a bigger issue in the Europe that is here where they have much more stringent requirements on the ability for companies to use like individual hopeful data right so the question was about data ownership if I understood correctly and my views on it yeah so I'll try to break data ownership down and two views one is that I'm hearing what I'm hearing you say where ownership is really I can Europe with the individual right so it's it's my data those are my emails Google and if I want to move them to hotmail you gotta allow me to do that which is a very European view which Europe is actually through regulation trying to impose on any business doing business in Europe so it also applies to tech companies but then you know I was in Silicon Valley talking to a Stanford guy number of years ago and putting forth that European stance no you can't how the data is mine right that's it's my data and he was saying a Stan you know just forget about it you've already lost the weather you think's or no the data is already in the hands of these technology companies so that's definitely going to continue and I'm hoping that the regulations will impose enough sanctions so that is big you know Internet giant are actually allowing more control of an individual's data because that's currently the complete walkin and that's going to at risk of monopolizing markets anyway if you talk about ownership in the company's view like I'm a company and we have all these databases and who is the owner of that that's a very interesting topic right so nobody wants to take responsibility no no I don't want to be the data so there what you have to do is you have to sort of sneak on a ship under the door if the sort of say to the business executive of the process or application owner yeah it's just your face next to this data domain you know it's don't worry about it until you know you're like a year down the line and they actually start adopting it so I'd be happy to do to go deeper but those are like two angles on ownership thank you and we're running a little bit low you're going to be around after the talks yeah I drink okay so people can ask you questions directly thank you so much this was very big [Applause]