Command And Control Structures In Malware Bibtex Bibliography

Posted on by Nehn
This module may require a complete rewrite in order to suit its intended audience.
You can help rewrite it. Please see the relevant discussion.

For any academic/research writing, incorporating references into a document is an important task. Fortunately, LaTeX has a variety of features that make dealing with references much simpler, including built-in support for citing references. However, a much more powerful and flexible solution is achieved thanks to an auxiliary tool called BibTeX (which comes bundled as standard with LaTeX). Recently, BibTeX has been succeeded by BibLaTeX, a tool configurable within LaTeX syntax.

BibTeX provides for the storage of all references in an external, flat-file database. (BibLaTeX uses this same syntax.) This database can be referenced in any LaTeX document, and citations made to any record that is contained within the file. This is often more convenient than embedding them at the end of every document written; a centralized bibliography source can be linked to as many documents as desired (write once, read many!). Of course, bibliographies can be split over as many files as one wishes, so there can be a file containing sources concerning topic A () and another concerning topic B (). When writing about topic AB, both of these files can be linked into the document (perhaps in addition to sources specific to topic AB).

Embedded system[edit]

If you are writing only one or two documents and aren't planning on writing more on the same subject for a long time, you might not want to waste time creating a database of references you are never going to use. In this case you should consider using the basic and simple bibliography support that is embedded within LaTeX.

LaTeX provides an environment called that you have to use where you want the bibliography; that usually means at the very end of your document, just before the command. Here is a practical example:

\begin{thebibliography}{9}\bibitem{lamport94} Leslie Lamport, \textit{\LaTeX: a document preparation system}, Addison Wesley, Massachusetts, 2nd edition, 1994. \end{thebibliography}

OK, so what is going on here? The first thing to notice is the establishment of the environment. is a keyword that tells LaTeX to recognize everything between the begin and end tags as data for the bibliography. The mandatory argument, which I supplied after the begin statement, is telling LaTeX how wide the item label will be when printed. Note however, that the number itself is not the parameter, but the number of digits is. Therefore, I am effectively telling LaTeX that I will only need reference labels of one character in length, which ultimately means no more than nine references in total. If you want more than nine, then input any two-digit number, such as '56' which allows up to 99 references.

Next is the actual reference entry itself. This is prefixed with the command. The cite_key should be a unique identifier for that particular reference, and is often some sort of mnemonic consisting of any sequence of letters, numbers and punctuation symbols (although not a comma). I often use the surname of the first author, followed by the last two digits of the year (hence lamport94). If that author has produced more than one reference for a given year, then I add letters after, 'a', 'b', etc. But, you should do whatever works for you. Everything after the key is the reference itself. You need to type it as you want it to be presented. I have put the different parts of the reference, such as author, title, etc., on different lines for readability. These linebreaks are ignored by LaTeX. The command formats the title properly in italics.


To actually cite a given document is very easy. Go to the point where you want the citation to appear, and use the following: , where the cite_key is that of the bibitem you wish to cite. When LaTeX processes the document, the citation will be cross-referenced with the bibitems and replaced with the appropriate number citation. The advantage here, once again, is that LaTeX looks after the numbering for you. If it were totally manual, then adding or removing a reference would be a real chore, as you would have to re-number all the citations by hand.

Instead of WYSIWYG editors, typesetting systems like \TeX{} or \LaTeX{}\cite{lamport94} can be used.

Referring more specifically[edit]

If you want to refer to a certain page, figure or theorem in a text book, you can use the arguments to the command:

\cite[chapter, p.~215]{citation01}

The argument, "p. 215", will show up inside the same brackets. Note the tilde in [p.~215], which replaces the end-of-sentence spacing with a non-breakable inter-word space. This non-breakable inter-word space is inserted because the end-of-sentence spacing would be too wide, and "p." should not be separated from the page number.

Multiple citations[edit]

When a sequence of multiple citations is needed, you should use a single command. The citations are then separated by commas. Here's an example:


The result will then be shown as citations inside the same brackets, depending on the citation style.

Bibliography styles[edit]

There are several different ways to format lists of bibliographic references and the citations to them in the text. These are called citation styles, and consist of two parts: the format of the abbreviated citation (i.e. the marker that is inserted into the text to identify the entry in the list of references) and the format of the corresponding entry in the list of references, which includes full bibliographic details.

Abbreviated citations can be of two main types: numbered or textual. Numbered citations (also known as the Vancouver referencing system) are numbered consecutively in order of appearance in the text, and consist in Arabic numerals in parentheses (1), square brackets [1], superscript1, or a combination thereof[1]. Textual citations (also known as the Harvard referencing system) use the author surname and (usually) the year as the abbreviated form of the citation, which is normally fully (Smith 2008) or partially enclosed in parenthesis, as in Smith (2008). The latter form allows the citation to be integrated in the sentence it supports.

Below you can see three of the styles available with LaTeX:

Here are some more often used styles:

Style NameAuthor Name FormatReference FormatSorting
plainHomer Jay Simpson#ID#by author
unsrtHomer Jay Simpson#ID#as referenced
abbrvH. J. Simpson#ID#by author
alphaHomer Jay SimpsonSim95by author
abstractHomer Jay SimpsonSimpson-1995a
acmSimpson, H. J.#ID#
authordate1Simpson, Homer JaySimpson, 1995
apaciteSimpson, H. J. (1995)Simpson1995
namedHomer Jay SimpsonSimpson 1995

However, keep in mind that you will need to use the natbib package to use most of these.

No cite[edit]

If you only want a reference to appear in the bibliography, but not where it is referenced in the main text, then the command can be used, for example:

Lamport showed in 1995 something... \nocite{lamport95}.

A special version of the command, , includes all entries from the database, whether they are referenced in the document or not.


Citation commandOutput

Goossens et al. (1993)
(Goossens et al., 1993)

Goossens, Mittlebach, and Samarin (1993)
(Goossens, Mittlebach, and Samarin, 1993)

Goossens et al.
Goossens, Mittlebach, and Samarin


Goossens et al. 1993
Goossens et al., 1993
(priv. comm.)

Using the standard LaTeX bibliography support, you will see that each reference is numbered and each citation corresponds to the numbers. The numeric style of citation is quite common in scientific writing. In other disciplines, the author-year style, e.g., (Roberts, 2003), such as Harvard is preferred. A discussion about which is best will not occur here, but a possible way to get such an output is by the package. In fact, it can supersede LaTeX's own citation commands, as Natbib allows the user to easily switch between Harvard or numeric.

The first job is to add the following to your preamble in order to get LaTeX to use the Natbib package:


Also, you need to change the bibliography style file to be used, so edit the appropriate line at the bottom of the file so that it reads: . Once done, it is basically a matter of altering the existing commands to display the type of citation you want.

plainnatProvidednatbib-compatible version of plain
abbrvnatProvidednatbib-compatible version of abbrv
unsrtnatProvidednatbib-compatible version of unsrt
apsrevREVTeX 4 home pagenatbib-compatible style for Physical Review journals
rmpapsREVTeX 4 home pagenatbib-compatible style for Review of Modern Physics journals
IEEEtranNTeX Catalogue entrynatbib-compatible style for IEEE publications
achemsoTeX Catalogue entrynatbib-compatible style for American Chemical Society journals
rscTeX Catalogue entrynatbib-compatible style for Royal Society of Chemistry journals


 :  :  : Parentheses () (default), square brackets [], curly braces {} or angle brackets <>
 : multiple citations are separated by semi-colons (default) or commas
 :  : author year style citations (default), numeric citations or superscripted numeric citations
 : multiple citations are sorted into the order in which they appear in the references section or also compressing multiple numeric citations where possible
the first citation of any reference will use the starred variant (full author list), subsequent citations will use the abbreviated et al. style
for use with the chapterbib package. redefines \thebibliography to issue \section* instead of \chapter*
keeps all the authors’ names in a citation on one line to fix some hyperref problems - causes overfull hboxes

The main commands simply add a t for 'textual' or p for 'parenthesized', to the basic command. You will also notice how Natbib by default will compress references with three or more authors to the more concise 1st surname et al version. By adding an asterisk (*), you can override this default and list all authors associated with that citation. There are some other specialized commands that Natbib supports, listed in the table here. Keep in mind that for instance does not support and will automatically choose between all authors and et al..

The final area that I wish to cover about Natbib is customizing its citation style. There is a command called that can be used to override the defaults and change certain settings. For example, I have put the following in the preamble:


The command requires six mandatory parameters.

  1. The symbol for the opening bracket.
  2. The symbol for the closing bracket.
  3. The symbol that appears between multiple citations.
  4. This argument takes a letter:
    • n - numerical style.
    • s - numerical superscript style.
    • any other letter - author-year style.
  5. The punctuation to appear between the author and the year (in parenthetical case only).
  6. The punctuation used between years, in multiple citations when there is a common author. e.g., (Chomsky 1956, 1957). If you want an extra space, then you need .

Some of the options controlled by are also accessible by passing options to the natbib package when it is loaded. These options also allow some other aspect of the bibliography to be controlled, and can be seen in the table (right).

So as you can see, this package is quite flexible, especially as you can easily switch between different citation styles by changing a single parameter. Do have a look at the Natbib manual, it's a short document and you can learn even more about how to use it.


I have previously introduced the idea of embedding references at the end of the document, and then using the command to cite them within the text. In this tutorial, I want to do a little better than this method, as it's not as flexible as it could be. I will concentrate on using BibTeX.

A BibTeX database is stored as a .bib file. It is a plain text file, and so can be viewed and edited easily. The structure of the file is also quite simple. An example of a BibTeX entry:

@article{greenwade93,author="George D. Greenwade",title="The {C}omprehensive {T}ex {A}rchive {N}etwork ({CTAN})",year="1993",journal="TUGBoat",volume="14",number="3",pages="342--351"}

Each entry begins with the declaration of the reference type, in the form of . BibTeX knows of practically all types you can think of, common ones are: book, article, and for papers presented at conferences, there is inproceedings. In this example, I have referred to an article within a journal.

After the type, you must have a left curly brace '' to signify the beginning of the reference attributes. The first one follows immediately after the brace, which is the citation key, or the BibTeX key. This key must be unique for all entries in your bibliography. It is this identifier that you will use within your document to cross-reference it to this entry. It is up to you as to how you wish to label each reference, but there is a loose standard in which you use the author's surname, followed by the year of publication. This is the scheme that I use in this tutorial.

Next, it should be clear that what follows are the relevant fields and data for that particular reference. The field names on the left are BibTeX keywords. They are followed by an equals sign (=) where the value for that field is then placed. BibTeX expects you to explicitly label the beginning and end of each value. I personally use quotation marks ("), however, you also have the option of using curly braces ('{', '}'). But as you will soon see, curly braces have other roles, within attributes, so I prefer not to use them for this job as they can get more confusing. A notable exception is when you want to use characters with umlauts (ü, ö, etc), since their notation is in the format , and the quotation mark will close the one opening the field, causing an error in the parsing of the reference. Using in the preamble to the source file can get round this, as the accented characters can just be stored in the file without any need for special markup. This allows a consistent format to be kept throughout the file, avoiding the need to use braces when there are umlauts to consider.

Remember that each attribute must be followed by a comma to delimit one from another. You do not need to add a comma to the last attribute, since the closing brace will tell BibTeX that there are no more attributes for this entry, although you won't get an error if you do.

It can take a while to learn what the reference types are, and what fields each type has available (and which ones are required or optional, etc). So, look at this entry type reference and also this field reference for descriptions of all the fields. It may be worth bookmarking or printing these pages so that they are easily at hand when you need them. Much of the information contained therein is repeated in the following table for your convenience.

articlebookbookletinbookincollectioninproceedings ≈ conferencemanualmastersthesis, phdthesismiscproceedingstech reportunpublished

+ Required fields, o Optional fields


BibTeX can be quite clever with names of authors. It can accept names in forename surname or surname, forename. I personally use the former, but remember that the order you input them (or any data within an entry for that matter) is customizable and so you can get BibTeX to manipulate the input and then output it however you like. If you use the forename surname method, then you must be careful with a few special names, where there are compound surnames, for example "John von Neumann". In this form, BibTeX assumes that the last word is the surname, and everything before is the forename, plus any middle names. You must therefore manually tell BibTeX to keep the 'von' and 'Neumann' together. This is achieved easily using curly braces. So the final result would be "John {von Neumann}". This is easily avoided with the surname, forename, since you have a comma to separate the surname from the forename.

Secondly, there is the issue of how to tell BibTeX when a reference has more than one author. This is very simply done by putting the keyword and in between every author. As we can see from another example:

@book{goossens93,author="Michel Goossens and Frank Mittelbach and Alexander Samarin",title="The LaTeX Companion",year="1993",publisher="Addison-Wesley",address="Reading, Massachusetts"}

This book has three authors, and each is separated as described. Of course, when BibTeX processes and outputs this, there will only be an 'and' between the penultimate and last authors, but within the .bib file, it needs the ands so that it can keep track of the individual authors.

Standard templates[edit]

Be careful if you copy the following templates, the % sign is not valid to comment out lines in bibtex files. If you want to comment out a line, you have to put it outside the entry.

An article from a magazine or a journal.
  • Required fields: author, title, journal, year.
  • Optional fields: volume, number, pages, month, note.
A published book
  • Required fields: author/editor, title, publisher, year.
  • Optional fields: volume/number, series, address, edition, month, note.
A bound work without a named publisher or sponsor.
  • Required fields: title.
  • Optional fields: author, howpublished, address, month, year, note.
Equal to inproceedings
  • Required fields: author, title, booktitle, year.
  • Optional fields: editor, volume/number, series, pages, address, month, organization, publisher, note.
A section of a book without its own title.
  • Required fields: author/editor, title, chapter and/or pages, publisher, year.
  • Optional fields: volume/number, series, type, address, edition, month, note.
A section of a book having its own title.
  • Required fields: author, title, booktitle, publisher, year.
  • Optional fields: editor, volume/number, series, type, chapter, pages, address, edition, month, note.
An article in a conference proceedings.
  • Required fields: author, title, booktitle, year.
  • Optional fields: editor, volume/number, series, pages, address, month, organization, publisher, note.
Technical manual
  • Required fields: title.
  • Optional fields: author, organization, address, edition, month, year, note.
Master's thesis
  • Required fields: author, title, school, year.
  • Optional fields: type (eg. "diploma thesis"), address, month, note.
Template useful for other kinds of publication
  • Required fields: none
  • Optional fields: author, title, howpublished, month, year, note.
Ph.D. thesis
  • Required fields: author, title, year, school.
  • Optional fields: address, month, keywords, note.
The proceedings of a conference.
  • Required fields: title, year.
  • Optional fields: editor, volume/number, series, address, month, organization, publisher, note.
Technical report from educational, commercial or standardization institution.
  • Required fields: author, title, institution, year.
  • Optional fields: type, number, address, month, note.
An unpublished article, book, thesis, etc.
  • Required fields: author, title, note.
  • Optional fields: month, year.

Non-standard templates[edit]

BibTeX entries can be exported from Google Patents.
(see Cite Patents with Bibtex for an alternative)
For citing papers in a REVTEX-style article
(see REVTEX Author's guide)

Preserving case of letters[edit]

In the event that BibTeX has been set by the chosen style not to preserve all capitalization within titles, problems can occur, especially if you are referring to proper nouns, or acronyms. To tell BibTeX to keep them, use the good old curly braces around the letter in question, (or letters, if it's an acronym) and all will be well! It is even possible that lower-case letters may need to be preserved - for example if a chemical formula is used in a style that sets a title in all caps or small caps, or if "pH" is to be used in a style that capitalises all first letters.

However, avoid putting the whole title in curly braces, as it will look odd if a different capitalization format is used:

For convenience though, many people simply put double curly braces, which may help when writing scientific articles for different magazines, conferences with different BibTex styles that do sometimes keep and sometimes not keep the capital letters:

As an alternative, try other BibTex styles or modify the existing. The approach of putting only relevant text in curly brackets is the most feasible if using a template under the control of a publisher, such as for journal submissions. Using curly braces around single letters is also to be avoided if possible, as it may mess up the kerning, especially with biblatex,[1] so the first step should generally be to enclose single words in braces.

A few additional examples[edit]

Below you will find a few additional examples of bibliography entries. The first one covers the case of multiple authors in the Surname, Firstname format, and the second one deals with the incollection case.

@article{AbedonHymanThomas2003,author="Abedon, S. T. and Hyman, P. and Thomas, C.",year="2003",title="Experimental examination of bacteriophage latent-period evolution as a response to bacterial availability",journal="Applied and Environmental Microbiology",volume="69",pages="7499--7506"}@incollection{Abedon1994,author="Abedon, S. T.",title="Lysis and the interaction between free phages and infected cells",pages="397--405",booktitle="Molecular biology of bacteriophage T4",editor="Karam, Jim D. Karam and Drake, John W. and Kreuzer, Kenneth N. and Mosig, Gisela and Hall, Dwight and Eiserling, Frederick A. and Black, Lindsay W. and Kutter, Elizabeth and Carlson, Karin and Miller, Eric S. and Spicer, Eleanor",publisher="ASM Press, Washington DC",year="1994"}

If you have to cite a website you can use @misc, for example:

@misc{website:fermentas-lambda,author="Fermentas Inc.",title="Phage Lambda: description \& restriction map",month="November",year="2008",url=""}

The note field comes in handy if you need to add unstructured information, for example that the corresponding issue of the journal has yet to appear:

@article{blackholes,author="Rabbert Klein",title="Black Holes and Their Relation to Hiding Eggs",journal="Theoretical Easter Physics",publisher="Eggs Ltd.",year="2010",note="(to appear)"}

Getting current LaTeX document to use your .bib file[edit]

At the end of your LaTeX file (that is, after the content, but before ), you need to place the following commands:

\bibliographystyle{plain}\bibliography{sample1,sample2,...,samplen}% Note the lack of whitespace between the commas and the next bib file.

Bibliography styles are files recognized by BibTeX that tell it how to format the information stored in the file when processed for output. And so the first command listed above is declaring which style file to use. The style file in this instance is (which comes as standard with BibTeX). You do not need to add the .bst extension when using this command, as it is assumed. Despite its name, the plain style does a pretty good job (look at the output of this tutorial to see what I mean).

The second command is the one that actually specifies the file you wish to use. The ones I created for this tutorial were called , , . . ., , but once again, you don't include the file extension. At the moment, the file is in the same directory as the LaTeX document too. However, if your .bib file was elsewhere (which makes sense if you intend to maintain a centralized database of references for all your research), you need to specify the path as well, e.g or (if the file is in the parent directory of the document that calls it).

Now that LaTeX and BibTeX know where to look for the appropriate files, actually citing the references is fairly trivial. The is the command you need, making sure that the ref_key corresponds exactly to one of the entries in the .bib file. If you wish to cite more than one reference at the same time, do the following: .

Why won't LaTeX generate any output?[edit]

The addition of BibTeX adds extra complexity for the processing of the source to the desired output. This is largely hidden from the user, but because of all the complexity of the referencing of citations from your source LaTeX file to the database entries in another file, you actually need multiple passes to accomplish the task. This means you have to run LaTeX a number of times. Each pass will perform a particular task until it has managed to resolve all the citation references. Here's what you need to type (into command line):

    (Extensions are optional, if you put them note that the bibtex command takes the AUX file as input.)

    After the first LaTeX run, you will see errors such as:

    LaTeX Warning: Citation `lamport94' on page 1 undefined on input line 21. ... LaTeX Warning: There were undefined references.

    The next step is to run bibtex on that same LaTeX source (or more precisely the corresponding AUX file, however not on the actual .bib file) to then define all the references within that document. You should see output like the following:

    This is BibTeX, Version 0.99c (Web2C 7.3.1) The top-level auxiliary file: latex_source_code.aux The style file: plain.bst Database file #1: sample.bib

    The third step, which is invoking LaTeX for the second time will see more errors like "". Don't be alarmed, it's almost complete. As you can guess, all you have to do is follow its instructions, and run LaTeX for the third time, and the document will be output as expected, without further problems.

    If you want a pdf output instead of a dvi output you can use instead of as follows:

      (Extensions are optional, if you put them note that the bibtex command takes the AUX file as input.)

      Note that if you are editing your source in vim and attempt to use command mode and the current file shortcut (%) to process the document like this:

        You will get an error similar to this:

          It appears that the file extension is included by default when the current file command (%) is executed. To process your document from within vim, you must explicitly name the file without the file extension for bibtex to work, as is shown below:

          1. (without file extension, it looks for the AUX file as mentioned above)

          However, it is much easier to install the Vim-LaTeX plugin from here. This allows you to simply type \ll when not in insert mode, and all the appropriate commands are automatically executed to compile the document. Vim-LaTeX even detects how many times it has to run pdflatex, and whether or not it has to run bibtex. This is just one of the many nice features of Vim-LaTeX, you can read the excellent Beginner's Tutorial for more about the many clever shortcuts Vim-LaTeX provides.

          Another option exists if you are running Unix/Linux or any other platform where you have make. Then you can simply create a Makefile and use vim's make command or use make in shell. The Makefile would then look like this:

          latex_source_code.pdf: latex_source_code.tex latex_source_code.bib pdflatex latex_source_code.tex bibtex latex_source_code.aux pdflatex latex_source_code.tex pdflatex latex_source_code.tex

          Including URLs in bibliography[edit]

          As you can see, there is no field for URLs. One possibility is to include Internet addresses in field of or field of , , :

          Note the usage of command to ensure proper appearance of URLs.

          Another way is to use special field and make bibliography style recognise it.

          You need to use in the first case or in the second case.

          Styles provided by Natbib (see below) handle this field, other styles can be modified using urlbst program. Modifications of three standard styles (plain, abbrv and alpha) are provided with urlbst.

          If you need more help about URLs in bibliography, visit FAQ of UK List of TeX.

          Customizing bibliography appearance[edit]

          One of the main advantages of BibTeX, especially for people who write many research papers, is the ability to customize your bibliography to suit the requirements of a given publication. You will notice how different publications tend to have their own style of formatting references, to which authors must adhere if they want their manuscripts published. In fact, established journals and conference organizers often will have created their own bibliography style (.bst file) for those users of BibTeX, to do all the hard work for you.

          It can achieve this because of the nature of the .bib database, where all the information about your references is stored in a structured format, but nothing about style. This is a common theme in LaTeX in general, where it tries as much as possible to keep content and presentation separate.

          A bibliography style file () will tell LaTeX how to format each attribute, what order to put them in, what punctuation to use in between particular attributes etc. Unfortunately, creating such a style by hand is not a trivial task. Which is why (also known as custom-bib) is the tool we need.

          can be used to automatically generate a .bst file based on your needs. It is very simple, and actually asks you a series of questions about your preferences. Once complete, it will then output the appropriate style file for you to use.

          It should be installed with the LaTeX distribution (otherwise, you can download it) and it's very simple to initiate. At the command line, type:

          latex makebst

          LaTeX will find the relevant file and the questioning process will begin. You will have to answer quite a few (although, note that the default answers are pretty sensible), which means it would be impractical to go through an example in this tutorial. However, it is fairly straight-forward. And if you require further guidance, then there is a comprehensive manual available. I'd recommend experimenting with it and seeing what the results are when applied to a LaTeX document.

          If you are using a custom built .bst file, it is important that LaTeX can find it! So, make sure it's in the same directory as the LaTeX source file, unless you are using one of the standard style files (such as plain or plainnat, that come bundled with LaTeX - these will be automatically found in the directories that they are installed. Also, make sure the name of the file you want to use is reflected in the command (but don't include the extension!).

          Localizing bibliography appearance[edit]

          When writing documents in languages other than English, you may find it desirable to adapt the appearance of your bibliography to the document language. This concerns words such as editors, and, or in as well as a proper typographic layout. The package can be used here. For example, to layout the bibliography in German, add the following to the header:


          Alternatively, you can layout each bibliography entry according to the language of the cited document:

          The language of an entry is specified as an additional field in the BibTeX entry:


          For to take effect, a bibliography style supported by it - one of , , , , , and - must be used:


          Showing unused items[edit]

          Usually LaTeX only displays the entries which are referred to with . It's possible to make uncited entries visible:

          \nocite{Name89}% Show Bibliography entry of Name89\nocite{*}% Show all Bib-entries

          Getting bibliographic data[edit]

          Many online databases provide bibliographic data in BibTeX-Format, making it easy to build your own database. For example, Google Scholar offers the option to return properly formatted output, which can also be turned on in the settings page.

          One should be alert to the fact that bibliographic databases are frequently the product of several generations of automatic processing, and so the resulting BibTex code is prone to a variety of minor errors, especially in older entries.

          Helpful tools[edit]

          See also: w:en:Comparison of reference management software
          • BibDesk BibDesk is a bibliographic reference manager for Mac OS X. It features a very usable user interface and provides a number of features like smart folders based on keywords and live tex display.
          • BibSonomy — A free social bookmark and publication management system based on BibTeX.
          • BibTeXSearch BibTeXSearch is a free searchable BibTeX database spanning millions of academic records.
          • Bibtex Editor - An online BibTeX entry generator and bibliography management system. Possible to import and export Bibtex files.
          • Bibwiki Bibwiki is a Specialpage for MediaWiki to manage BibTeX bibliographies. It offers a straightforward way to import and export bibliographic records.
          • cb2Bib The cb2Bib is a tool for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.
          • Citavi Commercial software (with size-limited free demo version) which even searches libraries for citations and keeps all your knowledge in a database. Export of the database to all kinds of formats is possible. Works together with MS Word and Open Office Writer. Moreover plug ins for browsers and Acrobat Reader exist to automatically include references to your project.
          • CiteULike CiteULike is a free online service to organise academic papers. It can export citations in BibTeX format, and can "scrape" BibTeX data from many popular websites.

          1School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China
          2Center for Cyber Security, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China
          3School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China

          Copyright © 2017 Weina Niu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

          Advanced Persistent Threat (APT) is a serious threat against sensitive information. Current detection approaches are time-consuming since they detect APT attack by in-depth analysis of massive amounts of data after data breaches. Specifically, APT attackers make use of DNS to locate their command and control (C&C) servers and victims’ machines. In this paper, we propose an efficient approach to detect APT malware C&C domain with high accuracy by analyzing DNS logs. We first extract 15 features from DNS logs of mobile devices. According to Alexa ranking and the VirusTotal’s judgement result, we give each domain a score. Then, we select the most normal domains by the score metric. Finally, we utilize our anomaly detection algorithm, called Global Abnormal Forest (GAF), to identify malware C&C domains. We conduct a performance analysis to demonstrate that our approach is more efficient than other existing works in terms of calculation efficiency and recognition accuracy. Compared with Local Outlier Factor (LOF), -Nearest Neighbor (KNN), and Isolation Forest (iForest), our approach obtains more than 99%  and for the detection of C&C domains. Our approach not only can reduce data volume that needs to be recorded and analyzed but also can be applicable to unsupervised learning.

          1. Introduction

          Advanced Persistent Threat (APT) [1, 2] is an attack that is launched by the well-funded and skilled organization to steal high-value information for a long time. APT attackers would install malware on the compromised machine to build command and control (C&C) channel after infiltrating into the targeted network. Most malware makes use of Domain Name System (DNS) to locate their domain name servers and compromised devices. Then, APT attackers can establish long-term connection to victims’ devices for stealing sensitive data. Thus, malware C&C domain detection can help security analysts to block essential stage of APT.

          Currently, there are some works to identify C&C domain by analyzing network traffic about PC [3–8]. BotSniffer [3], BotGAD [4], and BotMiner [5] made use of specific behavior anomaly (e.g., daily similarity and short life) to detect C&C involved in a botnet. The main reason is that bot hosts have group similarity. Other works [6–8] also distinguish between malicious domains and normal domains according to domain-based features, such as domain name string composition, registration time, and active time. However, these detection approaches cannot be applied to APT malware since APT attackers infect a small number of machines, and they behave normally to avoid detection. Machine learning technology is proved to be effective in identifying malware [6]. However, there are few artificially marked data of APT malware. Moreover, normal and abnormal samples overlap with each other.

          In order to address these challenges, we propose an approach to identifying APT malware domains based on DNS logs. We conduct experiments to evaluate our proposed algorithm, called Global Abnormal Forest (GAF), with three traditional algorithms, namely, Local Outlier Factor (LOF), -Nearest Neighbor combined with LOF (LOF-KNN), and Isolation Forest (iForest). The experimental results demonstrate that our proposed algorithm behaves best on a dataset consisting of 300000 DNS requests each day from a regional base station. Specifically, the contributions of this work are specified as follows:(i)We characterize statistics of normal domains and define a rule based on Alexa and VirusTotal to select the most normal domains.(ii)We extract 15 features of mobile DNS requests in multigranularity by studying large DNS logs in a real dynamic network environment consisting of 10K devices with more than 300,000 DNS requests per day.(iii)We propose an anomaly detection algorithm to compromise accuracy and efficiency of C&C domains detection by introducing differentiated information entropy.

          The structure of this paper is arranged as follows. we motivate the need for APT malware C&C detection using anomaly detection in Section 2; Section 3 presents an overview of the proposed approach and introduces the most normal domain identification rules, and we motivate the choice for features that are related to APT malware C&C domain in Section 3; Section 4 describes the building of our anomaly detection model; Section 5 completes experimental evaluation metrics and illustrates the experimental results of different algorithms; Section 6 introduces the related work; Section 7 makes a conclusion of the paper.

          2. Background on C&C Detection Using Anomaly Detection

          APT was first used in 2006 and has become widely known since the exposure of Google Aurora in 2010 [7]. In 2013, the APT attack was pushed to cusp due to PRISM. Thus, the APT attack has brought new challenges to cybersecurity due to long-latent, intelligence penetration and overcustomization [8, 9]. APT attackers often install DNS-based APT malware, for instance, Trojan horse or backdoor, on the infected machine for stealing sensitive data and hiding the real attack source. Identifying malware during their command control channel establishment phase is a good choice. However, DNS behavioral features of compromised machines infected by APT malware are different from the botnet. Thus, APT malware identification based on DNS data is a challenge.

          Suspicious instances of APT malware are rare and the amount of data cannot be fully labeled by the expert. The most normal domain instances within the DNS data are available. Moreover, anomaly detection [10] can identify new and unknown attack since it does not depend on fixed signatures. Thus, we use anomaly detection to identify malware C&C domain using mobile DNS logs. The most common anomaly detection includes statistical anomaly detection, classification-based anomaly detection, and clustering-based anomaly detection [11]. If the labeled set has been collected, classification-based anomaly detection, like Genetic Algorithm [12], Support Vector Machine [13], and Neural Network [14], is preferable. However, in the real APT attack, the label of data is very difficult to obtain. The unsupervised method can be used to identify malware C&C domain, such as LOF, LOF-KNN, and iForest. LOF [15] determines whether the data is an outlier according to neighbor density. LOF-KNN [16] identifies outlier according to similarity. However, these two approaches have high computational complexity and too many false alarms. To ease these two problems, iForest [17] detects anomalies using the average path length of trees that requires a small subsampling size to achieve high detection performance. Thus, we can build partial models and exploit subsampling to identify malware C&C domain. Isolation Forest is based on the assumption that each instance is isolated to an external node when a tree is grown. Unfortunately, attribute values of normal domain and malware domain are relatively close. Moreover, traditional anomaly detection algorithms ignore the different influences of different properties. In this work, we introduce differentiated information entropy to improve the efficiency and utilize distance measures to detect anomalies.

          3. Overview of Our Approach

          In this section, we present an overview of the proposed approach for identifying APT malware domain, explain why we select those features that may be indicative of APT malware domain, and illustrate the metric for selecting the most normal domains.

          3.1. Architecture of Our Approach

          DNS logs are small but important. Thus, this work mainly focuses on the analysis of DNS logs in order to detect suspicious domains involved in APT malware. We store DNS logs that contain accessing user, source IP, destination IP, country flag, domain name, request time, and response time. Then we extract features according to logs and make use of anomaly detection technology to identify APT malware C&C domain. Figure 1 gives an overview of the system architecture of the proposed approach. The system consists of components including the following: (1) DNS logs collector stores the DNS logs produced by mobile devices in the network that is being monitored; (2) multigranularity feature extractor is responsible for extracting features of domains that are stored in DNS log database; (3) normal domain identifier is used to select the most normal domains; (4) anomaly learning module trains anomaly detector using malware domain that is labeled by experts from grey set and APT malware C&C domain produced by detector, normal instance from normal set; (5) anomaly detector takes decisions according to the identification results produced by the anomaly detection model.

          Figure 1: Framework of our proposed identification approach.

          The deployment of the system consists of three steps. In the first step, the features that we interested are extracted. Details and motivations on the chosen features will be discussed in Section 3.2. The second step defines a metric to select normal domain used to train. The third step involves our proposed anomaly detection algorithm, which uses part of normal samples to predict C&C domains. The proposed algorithm is described in detail in Section 4. The result is a list of the suspicious domains involved in APT malware.

          3.2. Feature Extraction

          In this work, we extracted 15 features to detect APT malware C&C domains based on mobile DNS logs. We also gave explanations of the 15 features and explained the reasons that they can be used to detect malicious domain. The extracted domain features are shown in Table 1.

          Table 1: Features of domain name.

          3.2.1. DNS Request and Answer-Based Features

          APT attackers usually use servers residing in different countries to build C&C channel in order to evade detection. Moreover, attackers make use of fast flux to hide the true attack source [18]. APT attacker changes the C&C domain to point to predefined IP addresses, such as look back address and invalid IP address. With this insight, we extracted three features from DNS request and response, such as the number of distinct source IP addresses, the number of distinct IP addresses with the same domain, IP in the same country, and using the predefined IP addresses.

          3.2.2. Domain-Based Features

          Attackers prefer to use the long domain to hide the doubtful part [19]. By analyzing the network traffic produced during the malware communicates with command and control servers, we find that many malware C&C domains have the following characteristics: high level, long string, containing IP address, and low visitor number. Thus, Alexa ranking, the length of the domain, the level of domain, and containing IP address are helpful in identifying malware domain. For example, if a domain name contains an IP address, such as “”, we would conclude that it may be a malicious domain.

          3.2.3. Time-Based Features

          When there is a connecting failure in the process of compromised device connect to the C&C server, compromised machine may send many repeated DNS requests. Sometimes, behaviors of these infected devices show similarities. Since IP address of malware domain is not stored in the local server, the domain name resolution takes longer time. Moreover, we observe that few domains have high query frequency through analyzing the domain access records during one day in our experimental environment, which is illustrated in Figure 2. This phenomenon helps us to further identify malicious domain names. Thus, we extracted three features to identify APT malware C&C domain, such as request frequency, reaction time, and repeating pattern.

          Figure 2: Distribution of query frequency of distinct domain.

          3.2.4. Whois-Based Features

          Trustworthy domains are regularly paid for several years in advance and they have a long time to live [20]. However, most malware domains live for a short period of time, which is less than 6 months. Moreover, DNS record of the suspicious domain is empty or not found. Based on the above observation, we can use registration duration, active duration, update duration, and DNS record to detect malicious domain.

          3.3. A Metric for Normal Domain Judgement

          In order to implement anomaly detection, it is necessary to determine normal samples. An intuitive approach for selecting normal domains according to the number of DNS requests initiated by internal devices. However, in order to reduce exposure risk, APT attackers do not make use of malware C&C server to control too many infected machines. Moreover, in our experimental environment consisting of about 10K mobile devices, the distribution of the number of domains queried by internal devices during one day follows heavy-tailed distributions, as shown in Figure 3. There are about half-domains were queried each time. Thus, we can conclude that the number of distinct access devices cannot effectively identify the normal domain. By analyzing APT malware, we find that malicious domain ranked above the top 200,000 [21]. Thus, the number of visitors and the number of pages they visit are a feature used to identify the normal domain. Furthermore, VirusTotal aggregates numerous antivirus products and online scan engine to check for the malicious domain. Thus, we use Alexa ranking and VirusTotal results to judge normal domains, whose Alexa ranking is below 200,000 in international domains and 30,000 in domestic domains, and VirusTotal’s test result is less than 3.

          Figure 3: Distribution of the number of domain queries initiated by internal devices.

          4. Building Anomaly Detection

          In this section, we explained our anomaly detection algorithm, called GAF.

          Definition 1 (global abnormal tree). Let be the center of a global abnormal tree. is the number of samples in this global abnormal tree. A test, which consists of -variate such that the test has a larger distance from , is an outlier.
          Given a dataset of normal samples with -dimension features, in other words, , the global abnormal tree building process is illustrated as follows. Firstly, we select normal samples without replacement from the dataset to build training set . Secondly, we calculate the weight of each feature through introducing differentiated information entropy. Thirdly, we select the center of the normal samples according to

          An abnormal domain is acquired according to the distance from the node to the center of the global abnormal tree, which can be calculated using (2). As it is illustrated in (3), once the mean distance of tester is larger than the threshold value , it can be denoted as a suspicious domain.

          In order to identify the weight of each feature, we need to calculate information entropy of each feature using (4), where represents distinct values of normal samples in the dimension and represents the number of normal samples in the dimension whose value equals the value. Then, each feature splits set into two parts: and . Thus, the information entropy difference is calculated by (5), which is used to represent feature weight. In (5), the feature weight is normalized.

          In the process of anomaly detection based on global outlier factor, the tester is classified as abnormal according to the distance to the center of distinct global abnormal tree. In each tree, the centroid is calculated according to the normal samples selected from training test. And the weight of each feature in the different tree is calculated according to the current normal instances. The pseudocode of GAF algorithm is shown in Algorithm 1.

          5. Experiments and Results

          In this section, we introduce the experimental setup, the performance metrics, and the obtained results.

          5.1. Experimental Setup

          In this section, we evaluate the effectiveness of our proposed approach by collecting DNS logs from a network consisting of about 10K mobile devices for 2 weeks. This local area network with high-value information tends to be attacked by APT. Thus, there are many monitor devices deployed at the mobile base station to collect log records, including more than 300,000 DNS requests each day.

          Without deploying any filters, it cannot be able to record this large volume of traffic. Hence, the volume of DNS traffic head was restored in log collector to extract DNS logs. The saved field includes source IP, destination IP, domain, query time, and response time. The system had been implemented in Python 3.5, and all experiments were done using an off-the-shelf computer with Intel Core i7 at 3.6 GHz and 16 GB of RAM memory. In order to evaluate the true positive rates and false positive rates of our anomaly detection algorithm, we did the evaluating experiment in our training dataset including part of normal domains from the normal set and malicious domains marked by security experts.

          In our experiment, the parameter . Almost all of malware domains’ mean distance is larger than 0.2, while the mean distance of normal domains is no larger than 0.2 in our testing data. Figure 4 compares the distance between the C&C domains and normal domains. The -axis represents different testing samples, of which the first 60 are C&C domains, and the back 170 are normal domain names. A noticeable distinction is that almost all of C&C domains’ mean distance is larger than 0.2. Meanwhile, Figure 5 illustrates detection performances for malware C&C domain of different threshold. The performances of detection show our anomaly detection algorithm with the lowest false negative rate and false negative rate when the parameter .

          Figure 4: Difference distance between the C&C domains and normal domains.

          Figure 5: Recognition at different threshold.

          Parameter , . Using the testing data, we have examined the number of trees when increases from 10 to 90, and the number of samples when increases from 50 to 450. The results of the experiments are presented by Figures 6 and 7. We made a statistic of recognition rate for a different number of trees and samples. As shown in Figure 6, when increases from 10 to 50, the percentage of malicious domain identification increases; it is deduced that the scores of the number of trees are greater than 50. This is due to model overfitting. On the other hand, Figure 7 compares the effects of difference number of samples selected by each tree. Overall, when the size of samples is less than 200, false positive rate and false negative rate are decreasing. Thus, the size of samples used in each identification trees is set to 200 and the number of trees is set to 50 in our experimental environment.

          Figure 6: Recognition rate at different number of trees.

          Figure 7: Recognition rate at different size of samples.

          The parameters are shown in Table 2.

          Table 2: Experimental parameters settings.

          5.2. Results of Experiments and Discussion

          The detection performances of APT malware C&C domain are expressed by performance metrics that describes both accuracy and time requirements of different detection algorithms. The accuracy is expressed by following metrics:(1)False Recognition Rate: (2)Precision: (3)Recall Rate: (4)-Measure:

          In the above equations, refers to the number of normal domain names that are recognized as normal domain names, refers to the number of malicious domain names that are recognized as malicious domain names, refers to the number of malicious domain names that have been mistaken for normal domain names, and refers to the number of normal domain names that are incorrectly identified as normal domain names, respectively. Thus, the higher the value of Pr,

          Categories: 1

          0 Replies to “Command And Control Structures In Malware Bibtex Bibliography”

          Leave a comment

          L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *