Source Code and Cross-Domain Authorship
Attribution
The
Role of Stylometry in
Privacy
Stylometry is the study of linguistic style found in text. Stylometry existed long before computers but now the field is dominated by artificial intelligence techniques.
Writing style is a marker of identity that can be found in a document through linguistic information to perform authorship recognition. Authorship recognition is a threat to anonymity but knowing ways to identify authors provides methods for anonymizing authors as well. Even basic stylometry systems reach high accuracy in classifying authors correctly. Stylometry can also be used in source code to identify the author of a program. In this talk, we investigate methods to de-anonymize source code authors of C++ and authors across different domains.
Source code authorship attribution could provide proof of authorship in court, automate the process of finding a cyber criminal from the source code left in an infected system, or aid in resolving copyright, copyleft and plagiarism issues in the programming fields. Programmers can obfuscate their variable or
function names, but not the structures they subconsciously prefer to use or their favorite increment operators.
Following this intuition, we create a new feature set that reflects coding style from properties derived from abstract syntax trees. We reach 99% accuracy in attributing 36 authors each with ten files. We experiment with many different sized datasets leading to high true positive rates. Such a unique representation of coding style has not been used as a machine learning feature to attribute authors and therefore this is a valuable contribution to the field.
We also examine the need for cross-domain stylometry, where the documents of known authorship and the documents in question are written in different contexts. Specifically, we look at blogs, Twitter feeds, and Reddit comments. While traditional methods in stylometry that work well within one domain fail to identify authors across domains, we are able to improve the accuracy of cross-domain stylometry to as high as 80%. Being able to identify authors across domains facilitates linking identities across the Internet making this a key privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe.
Anonymity is a topic researched in detail at the Privacy,
Security, and Automation Lab at
Drexel University. We study how to effectively identify the author of text with unknown authors and how to anonymize text of known authorship. In our previous talks at
CCC, we have presented methods to identify authors of regular text, translated text and users a.k.a cyber-criminals of online underground forums. We introduced our authorship anonymization framework ‘Anonymouth’. Many times, we received questions on how applying de-anonymization techniques would work on source code and different domains. In this year’s talk, we will focus on identifying the authors of source code and cross-domain stylometry.
Can the authors of source code be identified automatically through features of their programming style? Do they leave coding “footprints”?
Holding important implications for protecting intellectual property as well as for identifying malware authors and tracking how malware spreads and evolves, this question spurred a cross-cutting research project involving
NLP and machine learning.
Code stylometry requires features unique to coding and to the programming language. Source code has different properties than common writing, such as the lineage, keywords, comments, the way functions and variables are created, and the grammar of the program
.
[...]
──────────
➤Speaker: Aylin, greenie, Rebekah Overdorf
➤EventID: 6173
➤Event: 31th
Chaos Communication Congress [31c3] of the
Chaos Computer Club [CCC]
➤Location:
Congress Centrum Hamburg (
CCH); Am
Dammtor; Marseiller Straße; 20355
Hamburg; Germany
➤
Language: english
➤
Begin: Mon, 12/29/2014 17:15:00 +01:00
➤
License:
CC-by
- published: 30 Dec 2014
- views: 663