Lexical analysis

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner. A lexer often exists as a single function which is called by a parser or another function.

Lexical grammar

The specification of a programming language will often include a set of rules which defines the lexer. These rules usually consist of regular expressions and they define the set of possible character sequences that are used to form individual tokens or lexemes; whitespace is also defined by a regular expression and influences the recognition of other tokens, but does not itself contribute any tokens.

Token

A token is a string of characters, categorized according to the rules as a symbol (e.g. IDENTIFIER, NUMBER, COMMA, etc.). The process of forming tokens from an input stream of characters is called tokenization and the lexer categorizes them according to a symbol type. A token can look like anything that is useful for processing an input text stream or text file.

A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parenthesis as tokens, but does nothing to ensure that each '(' is matched with a ')'.

Consider this expression in the C programming language: :sum=3+2; Tokenized in the following table: {| class="wikitable" |+ |lexeme||token type |- |sum||Identifier |- |=||Assignment operator |- |3||Number |- | +||Addition operator |- |2||Number |- |;||End of statement |}

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling.

Scanner

The first stage, the scanner, is usually based on a finite state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule, or longest match rule). In some languages the lexeme creation rules are more complicated and may involve backtracking over previously read characters.

Tokenizer

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example,

:The quick brown fox jumps over the lazy dog

The string isn't implicitly segmented on spaces, as an English speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e. matching the string " " or regular expression /\s{1}/.

The tokens could be represented in XML,

The quick brown fox jumps over the lazy dog

Or an s-expression,

(sentence ((word The) (word quick) (word brown) (word fox) (word jumps) (word over) (word the) (word lazy) (word dog)))

A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.)

For example, in the source code of a computer program the string

:net_worth_future = (assets - liabilities);

might be converted (with whitespace suppressed) into the lexical token stream:

NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON

Though it is possible and sometimes necessary, due to licensing restrictions of existing parsers or if the list of tokens is small, to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite state machine (which is plugged into template code for compilation and execution).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are not capable of keeping count, and verifying that n is the same on both sides — unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end. (see example in the SICP book)

The Lex programming tool and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection uses hand-written lexers.

Lexer generator

Lexical analysis can often be performed in a single pass if reading is done a character at a time. Single-pass lexers can be generated by tools such as the classic flex.

The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c and Quex have proven (e.g. RE2C - A More Versatile Scanner Generator (1994)) to produce engines that are between two to three times faster than flex produced engines. It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

The simple utility of using a scanner generator should not be discounted, especially in the developmental phase, when a language specification might change daily. The ability to express lexical constructs as regular expressions facilitates the description of a lexical analyzer. Some tools offer the specification of pre- and post-conditions which are hard to program by hand. In that case, using a scanner generator may save a lot of development time.

Lexical analyzer generators

ANTLR - ANTLR generates predicated-LL(k) lexers.

Flex - Alternative variant of the classic 'lex' (C/C++).

JFlex - a rewrite of JLex.

Ragel - A state machine and lexical scanner generator with output support for C, C++, Objective-C, D, Java and Ruby source code.

The following lexical analysers can handle Unicode:

JLex - A Lexical Analyzer Generator for Java.

Quex - (or 'Queχ') A Fast Universal Lexical Analyzer Generator for C++.

References

CS 164: Programming Languages and Compilers (Class Notes #2: Lexical)

Compiling with C# and Java, Pat Terry, 2005, ISBN 0-321-26360-X 624

Algorithms + Data Structures = Programs, Niklaus Wirth, 1975, ISBN 0-13-022418-9

Compiler Construction, Niklaus Wirth, 1996, ISBN 0-201-40353-6

Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.

Word Mention Segmentation Task analysis page

On the applicability of the longest-match rule in lexical analysis

Category:Compiler theory Category:Interpreters (computing) Category:Programming language implementation

This text is licensed under the Creative Commons CC-BY-SA License. This text was originally published on Wikipedia and was developed by the Wikipedia community.

Make changes yourself !
Login or register to EDIT and SAVE any of these pages.

Email this Page

Lexical Analysis

Order: Reorder
Duration: 1:08
Published: 22 Apr 2010
Uploaded: 26 Nov 2010
Author: SuperSpig91

Brief description of how Lexical Analysis works as part of a Translator. Information taken from OCR Computing A2 Specification - Unit F453. Hope this helps :)

Intro to Lexical Analysis

Order: Reorder
Duration: 11:45
Published: 25 Jan 2011
Uploaded: 23 Feb 2011
Author: TheOresoft

Compiler Design

intro to regular expression

Order: Reorder
Duration: 13:38
Published: 30 Jan 2011
Uploaded: 08 Feb 2011
Author: TheOresoft

Compiler Design

Staring with the LEX programming, By Chandan

Order: Reorder
Duration: 2:10
Published: 22 Nov 2009
Uploaded: 20 Feb 2011
Author: cdnatutube

Starting with LEX programming...

An Introduction to C++: The Compile Time Process (Part 1 of 4)

Order: Reorder
Duration: 14:36
Published: 02 Feb 2011
Uploaded: 02 Feb 2011
Author: ravingidiot

An introduction to the C++ programming language. In this episode, aspects of the C++ compiler are discussed, and a simple C++ program is written. This video is rather long, and as a result, I split it up into 4 parts, so be sure to watch everything in sequence! Sorry for the delay on this video. I had to rerecord it a few times due to techincal difficulties, and while I'm not entirely happy with the result, I'm using my last take, so I'll try and correct any errors I come across.

An Introduction to C++: The Compile Time Process (Part 2 of 4)

Order: Reorder
Duration: 14:34
Published: 02 Feb 2011
Uploaded: 02 Feb 2011
Author: ravingidiot

An Introduction to C++: The Compile Time Process (Part 3 of 4)

Order: Reorder
Duration: 14:58
Published: 02 Feb 2011
Uploaded: 14 Feb 2011
Author: ravingidiot

An introduction to the C++ programming language. In this episode, aspects of the C++ compiler are discussed, and a simple C++ program is written. This video is rather long, and as a result, I split it up into 4 parts, so be sure to watch everything in sequence! Errata: 5:15 - I state that function parameters are evaluated left to right, but this is not so according to C++98. (or, as far as I am aware, any further revision of ISO 14482) The order in which parameters are evaluated is implementation dependent.

An Introduction to C++: The Compile Time Process (Part 4 of 4)

Order: Reorder
Duration: 10:06
Published: 02 Feb 2011
Uploaded: 02 Feb 2011
Author: ravingidiot

Autonomous Speaker Agent

Order: Reorder
Duration: 0:48
Published: 10 Sep 2010
Uploaded: 10 Sep 2010
Author: HOTLabFER

Autonomous Speaker Agent is a graphically embodied animated agent (a virtual character) capable of reading plain English text and rendering it in a form of speech, accompanied by the appropriate, natural-looking facial gestures. The system uses lexical analysis and statistical models of facial gestures in order to generate the gestures related to the spoken text.

Building Japanese FrameNet: A lexical resource based on cognitive knowledge

Order: Reorder
Duration: 4:11
Published: 01 Apr 2010
Uploaded: 26 Aug 2010
Author: keiouniversity

What is our knowledge of language all about? As language has diversified so much in recent years, how much can computers understand human language? With this question in mind, the Ohara Laboratory is doing research on natural languages. The research is based on the idea that computers can be a starting point for looking at human language. That is, the problems that computers have in processing language reflect the key features of human ability to process it. So by looking at problems of language processing by computers, its true nature will become clear. Q. "When you look up a dictionary, what you find is definitions such as left is the opposite of right, and east is the opposite of west. But to understand the meaning of each word involves not just knowing purely linguistic meaning of the word like that, but also having encyclopedic knowledge of it. So in our descriptions of the meaning of words, we want to incorporate such encyclopedic knowledge, including common sense and scientific knowledge, which is not usually found in dictionaries. Thats the aim of our project." In kana to kanji conversion by computers, the accuracy has currently improved to over 90%. But getting a computer to understand the meaning of words, which changes subtly depending on the situation, is a problem thats yet to be solved. For example, the Japanese phrase kurumadematsu can mean wait in the car or wait until someone comes. A person can judge which of the two is meant from the words that precede <b>...</b>

Derivations In Syntax Analysis

Order: Reorder
Duration: 12:16
Published: 08 Feb 2011
Uploaded: 08 Feb 2011
Author: TheOresoft

Compiler Design

Intego Personal Antispam: Keeps your Mac Inbox Spam-Free

Order: Reorder
Duration: 9:27
Published: 07 Jul 2008
Uploaded: 23 Nov 2010
Author: IntegoVideo

Intego Personal Antispam is the ultimate spam-fighting tool for Mac OS X. Personal Antispam analyzes your incoming e-mail, determining which messages are spam and which are valid messages, and does so in many ways: by checking a whitelist and blacklist of addresses, message headers, content or layout, URLs in messages, and much more. Personal Antispam uses a set of plug-ins to provide the most powerful spam filter available. You choose which plug-ins you want to use, according to the type of spam you receive. Personal Antispam uses lexical filtering (also known as Bayesian analysis), attachment filters, whitelist and blacklist filters, and also filters URLs, using a database of known spam URLs that are often included in messages. Personal Antispam also looks for specific types of headers and layout used in HTML messages, which are most often used for spam, allowing the program to not only search for content, but also for the subtle ways that spammers craft their messages. After you have had Personal Antispam learn from your e-mail, you can export your settings and spam database, and give it to friends, family or colleagues, so these people can get optimal spam filtering immediately, without having to wait for Personal Antispam to learn from their e-mail. They can then have the program learn from the more specific types of spam and legitimate e-mail they receive to refine the program's efficiency. Spam techniques are constantly changing, and Personal Antispam offers <b>...</b>

Ambiguous Grammar Example

Order: Reorder
Duration: 6:18
Published: 05 Aug 2010
Uploaded: 20 Feb 2011
Author: abhiscar

A simple example of an ambiguous grammar( Theory Of Computation || Compiler Design) For Feedback & Queries visit http:\\abhilash-marichi.webnode.com

How the compiler works

Order: Reorder
Duration: 1:30
Published: 18 Aug 2010
Uploaded: 17 Feb 2011
Author: eurochild

Quick presentation about the compiler stages.

NZOSA 2010 - Catalyst Lifetime Achievement in Open Source Award - Project R

Order: Reorder
Duration: 3:32
Published: 16 Nov 2010
Uploaded: 23 Nov 2010
Author: NZOSA

In computing, R is a programming language and software environment for statistical computing and graphics. R is an implementation of the S programming language created by John Chambers while at Bell Labs, combined with lexical scoping semantics inspired by Scheme. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is now developed by the R Development Core Team. The R language has become a de facto standard among statisticians for the development of statistical software, and is widely used for statistical software development and data analysis. Mike O'Connor announces the award and presents it to Ross Ihaka. More information about the awards at www.nzosa.org.nz This recording is licensed under Creative Commons BY-SA 3.0 NZ creativecommons.org

Romans 5:12 vs. Refuting Calvinism (3of6): The Tree of Life and Its Purpose

Order: Reorder
Duration: 7:48
Published: 27 Aug 2009
Uploaded: 30 Jul 2010
Author: DefendingCalvinism

In today's Church we are seeing the rise of the popularity of the doctrines of the 4th Century heretic Pelagius: the denial of Original Sin, the non-essential nature of grace, Jesus is just a moral guide, etc. In the 3rd video of this series, I conclude my analysis of what the final, Restored/Renewed state meant to the Apostle Paul and show how this conforms with the Genesis account of the Fall. Also, I begin to investigate what the Tree of Life's purpose was and is. Common objections are given full consideration. God Bless! This series only addresses the responses of Kerrigan Skelly and those at Refuting Calvinism which they have given to this one verse, since in order to soundly interpret the rest of the cited passage, one has to first rightly understand its introductory statement in verse 12. I hope this series is helpful to you!! NOTE: All cited passages are taken from the UBS: Greek New Testament, 4th Revised Reader's Edition (translations provided by myself); my Lexical source is A Greek Lexicon of the New Testament and Other Early Christian Literature (Bauer, Arndt, Gingrich) 4th Revised and Augmented Edition, 1952. Continue watching the rest of the series at the following link: www.youtube.com OR www.youtube.com If you liked this video and would like to see more that defend the Faith, please visit my primary channel Paul1T2Day's VLog at www.youtube.com/paul1t2day.

Stages of a compiler constituting the backend.

Order: Reorder
Duration: 13:49
Published: 17 Jan 2011
Uploaded: 11 Feb 2011
Author: TheOresoft

Compiler Design

What is a compiler

Order: Reorder
Duration: 9:47
Published: 15 Jan 2011
Uploaded: 20 Feb 2011
Author: TheOresoft

what is a compiler and why we need that.

Second Stage of Compiler

Order: Reorder
Duration: 8:51
Published: 17 Jan 2011
Uploaded: 17 Jan 2011
Author: TheOresoft

Compiler Design

Parallel Processing

Order: Reorder
Duration: 2:08
Published: 07 May 2010
Uploaded: 19 Aug 2010
Author: SuperSpig91

Brief description of how Parallel Processing works. Information taken from OCR Computing A2 Specification - Unit F453.

Newspeak: A Principled Dynamic Language

Order: Reorder
Duration: 86:39
Published: 16 Jun 2010
Uploaded: 06 Dec 2010
Author: GoogleTechTalks

Google Tech Talk May 4, 2010 ABSTRACT In this talk, we present the main features of Newspeak, a dynamic programming language focused on software engineering. All names in Newspeak are late bound - including class names. Hence all classes in Newspeak are virtual, every class declaration defines a mixin, and class hierarchy inheritance comes for free. Newspeak has no global namespace and no static state. Top level classes act as module definitions. These have no external dependencies. Each instance of a top level class is a module that runs in its own sandbox in accordance with the object capability security model. Gilad Bracha is the creator of the Newspeak programming language. Previously, he was a Distinguished Engineer at Cadence, and a Computational Theologist and Distinguished Engineer at Sun. He is co-author of the Java Language Specification, and a researcher in the area of object-oriented programming languages. Prior to joining Sun, he worked on Strongtalk, the Animorphic Smalltalk System. He received his B.Sc in Mathematics and Computer Science from Ben Gurion University in Israel and a Ph.D. in Computer Science from the University of Utah.

YYCCC 2010-12-06 Calgary City Council - December 6, 2010

Order: Reorder
Duration: 417:41
Published: 08 Dec 2010
Uploaded: 25 Feb 2011
Author: gordonmcdowell

Here is new process: gordonmcdowell.com Dec 6 captions are generated by YouTube's Machine Transcription, so are not very accurate, but still handy for searching for key words using Interactive Transcript feature.

Play next
List all Videos
Autoplay
Autorepeat
Shuffle

clear restore
list images
close sort

Lexical Analysis