Google Security Blog

Syndikovat obsah
The latest news and insights from Google on security and safety on the Internet.
Aktualizace: 52 min 25 sek zpět

Combating Potentially Harmful Applications with Machine Learning at Google: Datasets and Models

15 Listopad, 2018 - 22:08
Posted by Mo Yu, Android Security & Privacy Team

[Cross-posted from the Android Developers Blog]

In a previous blog post, we talked about using machine learning to combat Potentially Harmful Applications (PHAs). This blog post covers how Google uses machine learning techniques to detect and classify PHAs. We'll discuss the challenges in the PHA detection space, including the scale of data, the correct identification of PHA behaviors, and the evolution of PHA families. Next, we will introduce two of the datasets that make the training and implementation of machine learning models possible, such as app analysis data and Google Play data. Finally, we will present some of the approaches we use, including logistic regression and deep neural networks.

Using Machine Learning to Scale

Detecting PHAs is challenging and requires a lot of resources. Our security experts need to understand how apps interact with the system and the user, analyze complex signals to find PHA behavior, and evolve their tactics to stay ahead of PHA authors. Every day, Google Play Protect (GPP) analyzes over half a million apps, which makes a lot of new data for our security experts to process.

Leveraging machine learning helps us detect PHAs faster and at a larger scale. We can detect more PHAs just by adding additional computing resources. In many cases, machine learning can find PHA signals in the training data without human intervention. Sometimes, those signals are different than signals found by security experts. Machine learning can take better advantage of this data, and discover hidden relationships between signals more effectively.

There are two major parts of Google Play Protect's machine learning protections: the data and the machine learning models.

Data Sources

The quality and quantity of the data used to create a model are crucial to the success of the system. For the purpose of PHA detection and classification, our system mainly uses two anonymous data sources: data from analyzing apps and data from how users experience apps.

App Data

Google Play Protect analyzes every app that it can find on the internet. We created a dataset by decomposing each app's APK and extracting PHA signals with deep analysis. We execute various processes on each app to find particular features and behaviors that are relevant to the PHA categories in scope (for example, SMS fraud, phishing, privilege escalation). Static analysis examines the different resources inside an APK file while dynamic analysis checks the behavior of the app when it's actually running. These two approaches complement each other. For example, dynamic analysis requires the execution of the app regardless of how obfuscated its code is (obfuscation hinders static analysis), and static analysis can help detect cloaking attempts in the code that may in practice bypass dynamic analysis-based detection. In the end, this analysis produces information about the app's characteristics, which serve as a fundamental data source for machine learning algorithms.

Google Play Data

In addition to analyzing each app, we also try to understand how users perceive that app. User feedback (such as the number of installs, uninstalls, user ratings, and comments) collected from Google Play can help us identify problematic apps. Similarly, information about the developer (such as the certificates they use and their history of published apps) contribute valuable knowledge that can be used to identify PHAs. All these metrics are generated when developers submit a new app (or new version of an app) and by millions of Google Play users every day. This information helps us to understand the quality, behavior, and purpose of an app so that we can identify new PHA behaviors or identify similar apps.

In general, our data sources yield raw signals, which then need to be transformed into machine learning features for use by our algorithms. Some signals, such as the permissions that an app requests, have a clear semantic meaning and can be directly used. In other cases, we need to engineer our data to make new, more powerful features. For example, we can aggregate the ratings of all apps that a particular developer owns, so we can calculate a rating per developer and use it to validate future apps. We also employ several techniques to focus in on interesting data.To create compact representations for sparse data, we use embedding. To help streamline the data to make it more useful to models, we use feature selection. Depending on the target, feature selection helps us keep the most relevant signals and remove irrelevant ones.

By combining our different datasets and investing in feature engineering and feature selection, we improve the quality of the data that can be fed to various types of machine learning models.


Building a good machine learning model is like building a skyscraper: quality materials are important, but a great design is also essential. Like the materials in a skyscraper, good datasets and features are important to machine learning, but a great algorithm is essential to identify PHA behaviors effectively and efficiently.

We train models to identify PHAs that belong to a specific category, such as SMS-fraud or phishing. Such categories are quite broad and contain a large number of samples given the number of PHA families that fit the definition. Alternatively, we also have models focusing on a much smaller scale, such as a family, which is composed of a group of apps that are part of the same PHA campaign and that share similar source code and behaviors. On the one hand, having a single model to tackle an entire PHA category may be attractive in terms of simplicity but precision may be an issue as the model will have to generalize the behaviors of a large number of PHAs believed to have something in common. On the other hand, developing multiple PHA models may require additional engineering efforts, but may result in better precision at the cost of reduced scope.

We use a variety of modeling techniques to modify our machine learning approach, including supervised and unsupervised ones.

One supervised technique we use is logistic regression, which has been widely adopted in the industry. These models have a simple structure and can be trained quickly. Logistic regression models can be analyzed to understand the importance of the different PHA and app features they are built with, allowing us to improve our feature engineering process. After a few cycles of training, evaluation, and improvement, we can launch the best models in production and monitor their performance.

For more complex cases, we employ deep learning. Compared to logistic regression, deep learning is good at capturing complicated interactions between different features and extracting hidden patterns. The millions of apps in Google Play provide a rich dataset, which is advantageous to deep learning.

In addition to our targeted feature engineering efforts, we experiment with many aspects of deep neural networks. For example, a deep neural network can have multiple layers and each layer has several neurons to process signals. We can experiment with the number of layers and neurons per layer to change model behaviors.

We also adopt unsupervised machine learning methods. Many PHAs use similar abuse techniques and tricks, so they look almost identical to each other. An unsupervised approach helps define clusters of apps that look or behave similarly, which allows us to mitigate and identify PHAs more effectively. We can automate the process of categorizing that type of app if we are confident in the model or can request help from a human expert to validate what the model found.

PHAs are constantly evolving, so our models need constant updating and monitoring. In production, models are fed with data from recent apps, which help them stay relevant. However, new abuse techniques and behaviors need to be continuously detected and fed into our machine learning models to be able to catch new PHAs and stay on top of recent trends. This is a continuous cycle of model creation and updating that also requires tuning to ensure that the precision and coverage of the system as a whole matches our detection goals.

Looking forward

As part of Google's AI-first strategy, our work leverages many machine learning resources across the company, such as tools and infrastructures developed by Google Brain and Google Research. In 2017, our machine learning models successfully detected 60.3% of PHAs identified by Google Play Protect, covering over 2 billion Android devices. We continue to research and invest in machine learning to scale and simplify the detection of PHAs in the Android ecosystem.


This work was developed in joint collaboration with Google Play Protect, Safe Browsing and Play Abuse teams with contributions from Andrew Ahn, Hrishikesh Aradhye, Daniel Bali, Hongji Bao, Yajie Hu, Arthur Kaiser, Elena Kovakina, Salvador Mandujano, Melinda Miller, Rahul Mishra, Damien Octeau, Sebastian Porst, Chuangang Ren, Monirul Sharif, Sri Somanchi, Sai Deep Tetali, Zhikun Wang, and Mo Yu.
Kategorie: Hacking & Security

Introducing the Android Ecosystem Security Transparency Report

9 Listopad, 2018 - 15:44
Posted by Jason Woloz and Eugene Liderman, Android Security & Privacy Team

Update: We identified a bug that affected how we calculated data from Q3 2018 in the Transparency Report. This bug created inconsistencies between the data in the report and this blog post. The data points in this blog post have been corrected.

As shared during the What's new in Android security session at Google I/O 2018, transparency and openness are important parts of Android's ethos. We regularly blog about new features and enhancements and publish an annual Android Security Year in Review, which highlights Android ecosystem trends. To provide more frequent insights, we're introducing a quarterly Android Ecosystem Security Transparency Report. This report is the latest addition to our Transparency Report site, which began in 2010 to show how the policies and actions of governments and corporations affect privacy, security, and access to information online.

This Android Ecosystem Security Transparency Report covers how often a routine, full-device scan by Google Play Protect detects a device with PHAs installed. Google Play Protect is built-in protection on Android devices that scans over 50 billion apps daily from inside and outside of Google Play. These scans look for evidence of Potentially Harmful Applications (PHAs). If the scans find a PHA, Google Play Protect warns the user and can disable or remove PHAs. In Android's first annual Android Security Year in Review from 2014, fewer than 1% of devices had PHAs installed. The percentage has declined steadily over time and this downward trend continues through 2018. The transparency report covers PHA rates in three areas: market segment (whether a PHA came from Google Play or outside of Google Play), Android version, and country.

Devices with Potentially Harmful Applications installed by market segment

Google works hard to protect your Android device: no matter where your apps come from. Continuing the trend from previous years, Android devices that only download apps from Google Play are 9 times less likely to get a PHA than devices that download apps from other sources. Before applications become available in Google Play they undergo an application review to confirm they comply with Google Play policies. Google uses a risk scorer to analyze apps to detect potentially harmful behavior. When Google’s application risk analyzer discovers something suspicious, it flags the app and refers the PHA to a security analyst for manual review if needed. We also scan apps that users download to their device from outside of Google Play. If we find a suspicious app, we also protect users from that—even if it didn't come from Google Play.

In the Android Ecosystem Security Transparency Report, the Devices with Potentially Harmful Applications installed by market segment chart shows the percentage of Android devices that have one or more PHAs installed over time. The chart has two lines: PHA rate for devices that exclusively install from Google Play and PHA rate for devices that also install from outside of Google Play. In 2017, on average 0.09% of devices that exclusively used Google Play had one or more PHAs installed. The first three quarters in 2018 averaged a lower PHA rate of 0.08%.

The security of devices that installed apps from outside of Google Play also improved. In 2017, ~0.82% of devices that installed apps from outside of Google Play were affected by PHA; in the first three quarters of 2018, ~0.68% were affected. Since 2017, we've reduced this number by expanding the auto-disable feature which we covered on page 10 in the 2017 Year in Review. While malware rates fluctuate from quarter to quarter, our metrics continue to show a consistent downward trend over time. We'll share more details in our 2018 Android Security Year in Review in early 2019.

Devices with Potentially Harmful Applications installed by Android version

Newer versions of Android are less affected by PHAs. We attribute this to many factors, such as continued platform and API hardening, ongoing security updates and app security and developer training to reduce apps' access to sensitive data. In particular, newer Android versions—such as Nougat, Oreo, and Pie—are more resilient to privilege escalation attacks that had previously allowed PHAs to gain persistence on devices and protect themselves against removal attempts. The Devices with Potentially Harmful Applications installed by Android version chart shows the percentage of devices with a PHA installed, sorted by the Android version that the device is running.

Devices with Potentially Harmful Applications rate by top 10 countries

Overall, PHA rates in the ten largest Android markets have remained steady. While these numbers fluctuate on a quarterly basis due to the fluidity of the marketplace, we intend to provide more in depth coverage of what drove these changes in our annual Year in Review in Q1, 2019.

The Devices with Potentially Harmful Applications rate by top 10 countries chart shows the percentage of devices with at least one PHA in the ten countries with the highest volume of Android devices. India saw the most significant decline in PHAs present on devices, with the average rate of infection dropping by 34 percent. Indonesia, Mexico, and Turkey also saw a decline in the likelihood of PHAs being present on devices in the region. South Korea saw the lowest number of devices containing PHA, with only 0.1%.

Check out the report

Over time, we'll add more insights into the health of the ecosystem to the Android Ecosystem Security Transparency Report. If you have any questions about terminology or the products referred to in this report please review the FAQs section of the Transparency Report. In the meantime, check out our new blog post and video outlining Android’s performance in Gartner’s Mobile OSs and Device Security: A Comparison of Platforms report.
Kategorie: Hacking & Security

A New Chapter for OSS-Fuzz

6 Listopad, 2018 - 22:11
Posted by Matt Ruhstaller, TPM and Oliver Chang, Software Engineer, Google Security Team

Open Source Software (OSS) is extremely important to Google, and we rely on OSS in a variety of customer-facing and internal projects. We also understand the difficulty and importance of securing the open source ecosystem, and are continuously looking for ways to simplify it.

For the OSS community, we currently provide OSS-Fuzz, a free continuous fuzzing infrastructure hosted on the Google Cloud Platform. OSS-Fuzz uncovers security vulnerabilities and stability issues, and reports them directly to developers. Since launching in December 2016, OSS-Fuzz has reported over 9,000 bugs directly to open source developers.

In addition to OSS-Fuzz, Google's security team maintains several internal tools for identifying bugs in both Google internal and Open Source code. Until recently, these issues were manually reported to various public bug trackers by our security team and then monitored until they were resolved. Unresolved bugs were eligible for the Patch Rewards Program. While this reporting process had some success, it was overly complex. Now, by unifying and automating our fuzzing tools, we have been able to consolidate our processes into a single workflow, based on OSS-Fuzz. Projects integrated with OSS-Fuzz will benefit from being reviewed by both our internal and external fuzzing tools, thereby increasing code coverage and discovering bugs faster.

We are committed to helping open source projects benefit from integrating with our OSS-Fuzz fuzzing infrastructure. In the coming weeks, we will reach out via email to critical projects that we believe would be a good fit and support the community at large. Projects that integrate are eligible for rewards ranging from $1,000 (initial integration) up to $20,000 (ideal integration); more details are available here. These rewards are intended to help offset the cost and effort required to properly configure fuzzing for OSS projects. If you would like to integrate your project with OSS-Fuzz, please submit your project for review. Our goal is to admit as many OSS projects as possible and ensure that they are continuously fuzzed.

Once contacted, we might provide a sample fuzz target to you for easy integration. Many of these fuzz targets are generated with new technology that understands how library APIs are used appropriately. Watch this space for more details on how Google plans to further automate fuzz target creation, so that even more open source projects can benefit from continuous fuzzing.

Thank you for your continued contributions to the Open Source community. Let’s work together on a more secure and stable future for Open Source Software.
Kategorie: Hacking & Security

Announcing some security treats to protect you from attackers’ tricks

2 Listopad, 2018 - 18:53
Posted by Jonathan Skelker, Product Manager

It’s Halloween ???? and the last day of Cybersecurity Awareness Month ????, so we’re celebrating these occasions with security improvements across your account journey: before you sign in, as soon as you’ve entered your account, when you share information with other apps and sites, and the rare event in which your account is compromised.

We’re constantly protecting your information from attackers’ tricks, and with these new protections and tools, we hope you can spend your Halloween worrying about zombies, witches, and your candy loot—not the security of your account.

Protecting you before you even sign in
Everyone does their best to keep their username and password safe, but sometimes bad actors may still get them through phishing or other tricks. Even when this happens, we will still protect you with safeguards that kick-in before you are signed into your account.

When your username and password are entered on Google’s sign-in page, we’ll run a risk assessment and only allow the sign-in if nothing looks suspicious. We’re always working to improve this analysis, and we’ll now require that JavaScript is enabled on the Google sign-in page, without which we can’t run this assessment.

Chances are, JavaScript is already enabled in your browser; it helps power lots of the websites people use everyday. But, because it may save bandwidth or help pages load more quickly, a tiny minority of our users (0.1%) choose to keep it off. This might make sense if you are reading static content, but we recommend that you keep Javascript on while signing into your Google Account so we can better protect you. You can read more about how to enable JavaScript here.
Keeping your Google Account secure while you’re signed in
Last year, we launched a major update to the Security Checkup that upgraded it from the same checklist for everyone, to a smarter tool that automatically provides personalized guidance for improving the security of your Google Account.
We’re adding to this advice all the time. Most recently, we introduced better protection against harmful apps based on recommendations from Google Play Protect, as well as the ability to remove your account from any devices you no longer use.More notifications when you share your account data with apps and sites
It’s really important that you understand the information that has been shared with apps or sites so that we can keep you safe. We already notify you when you’ve granted access to sensitive information — like Gmail data or your Google Contacts — to third-party sites or apps, and in the next few weeks, we’ll expand this to notify you whenever you share any data from your Google Account. You can always see which apps have access to your data in the Security Checkup.
Helping you get back to the beginning if you run into trouble
In the rare event that your account is compromised, our priority is to help get you back to safety as quickly as possible. We’ve introduced a new, step-by-step process within your Google Account that we will automatically trigger if we detect potential unauthorized activity.
We'll help you:
  • Verify critical security settings to help ensure your account isn’t vulnerable to additional attacks and that someone can’t access it via other means, like a recovery phone number or email address.
  • Secure your other accounts because your Google Account might be a gateway to accounts on other services and a hijacking can leave those vulnerable as well.
  • Check financial activity to see if any payment methods connected to your account, like a credit card or Google Pay, were abused.
  • Review content and files to see if any of your Gmail or Drive data was accessed or mis-used.
Online security can sometimes feel like walking through a haunted house—scary, and you aren't quite sure what may pop up. We are constantly working to strengthen our automatic protections to stop attackers and keep you safe you from the many tricks you may encounter. During Cybersecurity Month, and beyond, we've got your back.
Kategorie: Hacking & Security

Introducing reCAPTCHA v3: the new way to stop bots

30 Říjen, 2018 - 00:53
Posted by Wei Liu, Google Product Manager

[Cross-posted from the Google Webmaster Central Blog]

Today, we’re excited to introduce reCAPTCHA v3, our newest API that helps you detect abusive traffic on your website without user interaction. Instead of showing a CAPTCHA challenge, reCAPTCHA v3 returns a score so you can choose the most appropriate action for your website.

A frictionless user experience

Over the last decade, reCAPTCHA has continuously evolved its technology. In reCAPTCHA v1, every user was asked to pass a challenge by reading distorted text and typing into a box. To improve both user experience and security, we introduced reCAPTCHA v2 and began to use many other signals to determine whether a request came from a human or bot. This enabled reCAPTCHA challenges to move from a dominant to a secondary role in detecting abuse, letting about half of users pass with a single click. Now with reCAPTCHA v3, we are fundamentally changing how sites can test for human vs. bot activities by returning a score to tell you how suspicious an interaction is and eliminating the need to interrupt users with challenges at all. reCAPTCHA v3 runs adaptive risk analysis in the background to alert you of suspicious traffic while letting your human users enjoy a frictionless experience on your site.

More Accurate Bot Detection with "Actions"

In reCAPTCHA v3, we are introducing a new concept called “Action”—a tag that you can use to define the key steps of your user journey and enable reCAPTCHA to run its risk analysis in context. Since reCAPTCHA v3 doesn't interrupt users, we recommend adding reCAPTCHA v3 to multiple pages. In this way, the reCAPTCHA adaptive risk analysis engine can identify the pattern of attackers more accurately by looking at the activities across different pages on your website. In the reCAPTCHA admin console, you can get a full overview of reCAPTCHA score distribution and a breakdown for the stats of the top 10 actions on your site, to help you identify which exact pages are being targeted by bots and how suspicious the traffic was on those pages.
Fighting bots your way
Another big benefit that you’ll get from reCAPTCHA v3 is the flexibility to prevent spam and abuse in the way that best fits your website. Previously, the reCAPTCHA system mostly decided when and what CAPTCHAs to serve to users, leaving you with limited influence over your website’s user experience. Now, reCAPTCHA v3 will provide you with a score that tells you how suspicious an interaction is. There are three potential ways you can use the score. First, you can set a threshold that determines when a user is let through or when further verification needs to be done, for example, using two-factor authentication and phone verification. Second, you can combine the score with your own signals that reCAPTCHA can’t access—such as user profiles or transaction histories. Third, you can use the reCAPTCHA score as one of the signals to train your machine learning model to fight abuse. By providing you with these new ways to customize the actions that occur for different types of traffic, this new version lets you protect your site against bots and improve your user experience based on your website’s specific needs.

In short, reCAPTCHA v3 helps to protect your sites without user friction and gives you more power to decide what to do in risky situations. As always, we are working every day to stay ahead of attackers and keep the Internet easy and safe to use (except for bots).

Ready to get started with reCAPTCHA v3? Visit our developer site for more details.
Kategorie: Hacking & Security