The apostrophe ( ’ although often rendered as ' ) is a punctuation mark, and sometimes a diacritic mark, in languages that use the Latin alphabet or certain other alphabets. In English, it serves three purposes:[1]
According to the Oxford English Dictionary (OED), the word comes ultimately from Greek ἡ ἀπόστροφος [προσῳδία] (hē apóstrophos [prosōidía], "[the accent of] 'turning away', or elision"), through Latin and French.[2]
The apostrophe usually looks the same as a closing single quotation mark, although they have different meanings. The apostrophe also looks similar to the prime symbol ( ′ ), which is used to indicate measurement in feet or arcminutes, as well as for various mathematical purposes, and the ʻokina ( ʻ ), which represents a glottal stop in Polynesian languages.
Contents
|
The apostrophe was introduced into English in the 16th century in imitation of French practice.[3]
Introduced by Geoffroy Tory (1530), the apostrophe was used in place of a vowel letter to indicate elision (as in l'heure in place of la heure). It was frequently used in place of letter e when no actual vowel sound was elided (as in un' heure). Modern French orthography has restored the spelling une heure.[4]
From the 16th century, following French practice, the apostrophe was used when a vowel letter was omitted either because of incidental elision (I'm for I am) or because the letter no longer represented a sound (lov'd for loved). English spelling retained many inflections that were not pronounced as syllables, notably verb endings (-est, -eth, -es, -ed) and the noun ending -es, which marked either plurals or possessives (also known as genitives; see Possessive apostrophe, below). So apostrophe followed by s was often used to mark a plural, especially when the noun was a loan word (and especially a word ending in a, as in the two comma's).[3]
The use of elision has continued to the present day, but significant changes have been made to the possessive and plural uses. By the 18th century, apostrophe + s was regularly used for all possessive singular forms, even when the letter e was not omitted (as in the gate's height). This was regarded as representing the Old English genitive singular inflection -es. The plural use was greatly reduced, but a need was felt to mark possessive plural. The solution was to use an apostrophe after the plural s (as in girls' dresses). However, this was not universally accepted until the mid-19th century.[3]
The apostrophe is used to indicate possession. This convention distinguishes possessive singular forms (Bernadette's, flower's, glass's, one's) from simple plural forms (Bernadettes, flowers, glasses, ones), and both of those from possessive plural forms (Bernadettes', flowers', glasses', ones'). For singulars, the modern possessive or genitive inflection is a survival from certain genitive inflections in Old English, and the apostrophe originally marked the loss of the old e (for example, lambes became lamb's).
For most singular nouns the ending 's is added; e.g., the cat's whiskers.
When the noun is a normal plural, with an added s, no extra s is added in the possessive; so pens' caps (where there is more than one pen) is correct rather than pens's caps.
Compound nouns have their singular possessives formed with an apostrophe and an added s, in accordance with the rules given above: the Attorney-General's husband; the Lord Warden of the Cinque Ports' prerogative; this Minister for Justice's intervention; her father-in-law's new wife.
A distinction is made between joint possession (Jason and Sue's e-mails: the e-mails of both Jason and Sue), and separate possession (Jason's and Sue's e-mails: both the e-mails of Jason and the e-mails of Sue). Style guides differ only in how much detail they provide concerning these.[9] Their consensus is that if possession is joint, only the last possessor has possessive inflection; in separate possession all the possessors have possessive inflection. If, however, any of the possessors is indicated by a pronoun, then for both joint and separate possession all of the possessors have possessive inflection (his and her e-mails; his, her, and Anthea's e-mails; Jason's and her e-mails; His and Sue's e-mails; His and Sue's wedding; His and Sue's weddings).
Note that in cases of joint possession, the above rule does not distinguish between a situation in which only one or more jointly possessed items perform a grammatical role and a situation in which both one or more such items and a non-possessing entity independently perform that role. Although verb number suffices in some cases ("Jason and Sue's dog has porphyria") and context suffices in others ("Jason and Sue's e-mails rarely exceed 200 characters in length"), number and grammatical position often prevent a resolution of ambiguity:
If the word or compound includes, or even ends with, a punctuation mark, an apostrophe and an s are still added in the usual way: "Westward Ho!'s railway station;" Awaye!'s Paulette Whitten recorded Bob Wilson's story;[10] Washington, D.C.'s museums,[11] assuming that the prevailing style requires full stops in D.C.
An apostrophe is used in time and money references, among others, in constructions such as one hour's respite, two weeks' holiday, a dollar's worth, five pounds' worth, one mile's drive from here. This is like an ordinary possessive use. For example, one hour's respite means a respite of one hour (exactly as the cat's whiskers means the whiskers of the cat). Exceptions are accounted for in the same way: three months pregnant (in modern usage, we do not say pregnant of three months, nor one month(')s pregnant).
No apostrophe is used in the following possessive pronouns and adjectives: yours, his, hers, ours, its, theirs, and whose.
The possessive of it was originally it's, and many people continue to write it this way, though the apostrophe was dropped in the early 1800s and authorities are now unanimous that it's can be only a contraction of it is or it has.[14][15] For example, US President Thomas Jefferson used it's as a possessive in his instructions dated 20 June 1803 to Lewis for his preparations for his great expedition.[16]
All other possessive pronouns ending in s do take an apostrophe: one's; everyone's; somebody's, nobody else's, etc. With plural forms, the apostrophe follows the s, as with nouns: the others' husbands (but compare They all looked at each other's husbands, in which both each and other are singular).
Each of these four phrases (listed in Steven Pinker's The Language Instinct) has a distinct meaning:
Kingsley Amis, on being challenged to produce a sentence whose meaning depended on a possessive apostrophe, came up with:
This subsection deals with singular nouns pronounced with a sibilant sound at the end: /s/ or /z/. The spelling of these ends with -s, -se, -z, -ze, -ce, -x, or -xe.
Many respected authorities recommend that practically all singular nouns, including those ending with a sibilant sound, have possessive forms with an extra s after the apostrophe so that the spelling reflects the underlying pronunciation. Examples include Oxford University Press, the Modern Language Association, the BBC and The Economist.[18] Such authorities demand possessive singulars like these: Senator Jones's umbrella; Tony Adams's friend. Rules that modify or extend the standard principle have included the following:
However, some contemporary writers still follow the older practice of omitting the extra s in all cases ending with a sibilant, but usually not when written -x or -xe.[24] Some contemporary authorities such as the Associated Press Stylebook[25] and The Chicago Manual of Style recommend or allow the practice of omitting the extra "s" in all words ending with an "s", but not in words ending with other sibilants ("z" and "x").[26] The 15th edition of The Chicago Manual of Style still recommended the traditional practice, which included providing for several exceptions to accommodate spoken usage such as the omission of the extra s after a polysyllabic word ending in a sibilant. The 16th edition of CMOS no longer recommends omitting the extra "s".[27]
Similar examples of notable names ending in an s that are often given a possessive apostrophe with no additional s include Dickens and Williams. There is often a policy of leaving off the additional s on any such name, but this can prove problematic when specific names are contradictory (for example, St James' Park in Newcastle [the football ground] and the area of St. James's Park in London). For more details on practice with geographic names, see the relevant section below.
Some writers like to reflect standard spoken practice in cases like these with sake: for convenience' sake, for goodness' sake, for appearance' sake, for compromise' sake, etc. This punctuation is preferred in major style guides. Others prefer to add 's: for convenience's sake.[28] Still others prefer to omit the apostrophe when there is an s sound before sake: for morality's sake, but for convenience sake.[29]
The Supreme Court of the United States is split on whether a possessive singular noun that ends with s should always have an additional s after the apostrophe, sometimes have an additional s after the apostrophe (for instance, based on whether the final sound of the original word is pronounced /s/ or /z/), or never have an additional s after the apostrophe. The informal majority view (5–4, based on past writings of the justices) has favoured the additional s, but a strong minority disagrees.[30]
The English possessive of French nouns ending in a silent s, x, or z is rendered differently by different authorities. Some people prefer Descartes' and Dumas', while others insist on Descartes's and Dumas's.[citation needed] Certainly a sibilant is pronounced in these cases; the theoretical question is whether the existing final letter is sounded or whether s needs to be added.[citation needed] Similar examples with x or z: Sauce Périgueux's main ingredient is truffle; His pince-nez's loss went unnoticed; "Verreaux('s) eagle, a large, predominantly black eagle, Aquila verreauxi,..." (OED, entry for "Verreaux", with silent x; see Verreaux's eagle); in each of these some writers might omit the added s. The same principles and residual uncertainties apply with "naturalised" English words, like Illinois and Arkansas.[31]
For possessive plurals of words ending in silent x, z or s, the few authorities that address the issue at all typically call for an added s and require that the apostrophe precede the s: The Loucheux's homeland is in the Yukon; Compare the two Dumas's literary achievements.[32] The possessive of a cited French title with a silent plural ending is uncertain: "Trois femmes's long and complicated publication history",[33] but "Les noces' singular effect was 'exotic primitive'..." (with nearby sibilants -ce- in noces and s- in singular).[34] Compare treatment of other titles, above.
Guides typically seek a principle that will yield uniformity, even for foreign words that fit awkwardly with standard English punctuation.
Place names in the United States do not use the possessive apostrophe on federal maps and signs.[35] The United States Board on Geographic Names, which has responsibility for formal naming of municipalities and geographic features, has deprecated the use of possessive apostrophes since 1890 so as not to show ownership of the place.[35][36] Only five names of natural features in the U.S. are officially spelled with a genitive apostrophe (one example being Martha's Vineyard).[36][37] "
On the other hand, the United Kingdom has Bishop's Stortford, Bishop's Castle and King's Lynn (but St Albans, St Andrews and St Helens possibly because their names date to before the use was formalised[citation needed]) and, while Newcastle United play at a stadium previously called St James' Park, and Exeter City at St James Park, London has a St James's Park (this whole area of London is named after St James's Church, Piccadilly[38]). The special circumstances of the latter case may be this: the customary pronunciation of this place name is reflected in the addition of an extra -s; since usage is firmly against a doubling of the final -s without an apostrophe, this place name has an apostrophe. This could be regarded by some people as an example of a double genitive: it refers to the park of the church of St James.
Omission of the apostrophe in geographical names is becoming standard in some English-speaking countries, including Australia.[39] Modern usage has been influenced by considerations of technological convenience including the economy of typewriter ribbons and films, and similar computer character "disallowance" which tend to ignore traditional canons of correctness.[40] Practice in the United Kingdom and Canada is not so uniform.[41]
Sometimes the apostrophe is omitted in the names of clubs, societies, and other organizations, even though the standard principles seem to require it: Country Women's Association, but International Aviation Womens Association;[42] Magistrates' Court of Victoria,[43] but Federated Ship Painters and Dockers Union. Usage is variable and inconsistent. Style guides typically advise consulting an official source for the standard form of the name; some tend towards greater prescriptiveness, for or against such an apostrophe.[44] As the case of womens shows, it is not possible to analyze these forms simply as non-possessive plurals, since women is the only correct plural form of woman.
Where a business name is based on a family name it should take an apostrophe, but many leave it out (contrast Sainsbury's with Harrods). In recent times there has been an increasing tendency to drop the apostrophe. Names based on a first name are more likely to take an apostrophe (Joe's Crab Shack). Some business names may inadvertently spell a different name if the name with an s at the end is also a name, such as Parson. A small activist group called the Apostrophe Protection Society[45] has campaigned for large retailers such as Harrods, Currys, and Selfridges to reinstate their missing punctuation. A spokesperson for Barclays PLC stated, "It has just disappeared over the years. Barclays is no longer associated with the family name."[46] Further confusion can be caused by businesses whose names tend to look like they are pronounced differently without an apostrophe such as Paulos Circus, and other companies that leave the apostrophe out of their logos but include it in written text, such as Waterstone's and Cadwalader's.
An apostrophe is commonly used to indicate omitted characters, normally letters:
An apostrophe is used by some writers to form a plural for abbreviations, acronyms, and symbols where adding just s rather than 's may leave things ambiguous or inelegant. Some specific cases:
Names that are not strictly native to English sometimes have an apostrophe substituted to represent other characters (see also As a mark of elision, below).
In transliterated foreign words, an apostrophe may be used to separate letters or syllables that otherwise would likely be interpreted incorrectly. For example:
Furthermore, an apostrophe may be used to indicate a glottal stop in transliterations. For example:
Rather than ʿ the apostrophe is sometimes used to indicate a voiced pharyngeal fricative as it sounds and looks like the glottal stop to most English speakers. For example:
Failure to observe standard use of the apostrophe is widespread and frequently criticised as incorrect,[52][53] often generating heated debate. The British founder of the Apostrophe Protection Society earned a 2001 Ig Nobel prize for "efforts to protect, promote and defend the differences between plural and possessive".[54] A 2004 report by OCR, a British examination board, stated that "the inaccurate use of the apostrophe is so widespread as to be almost universal".[55] A 2008 survey found that nearly half of the UK adults polled were unable to use the apostrophe correctly.[53]
Apostrophes used in a non-standard manner to form noun plurals are known as greengrocers' apostrophes or grocers' apostrophes, often called (spelled) greengrocer's apostrophes[56] and grocer's apostrophes.[57] They are sometimes humorously called greengrocers apostrophe's, rogue apostrophes, or idiot's apostrophes (a literal translation of the German word Deppenapostroph, which criticises the misapplication of apostrophes in Denglisch). The practice, once common and acceptable (see Historical development), comes from the identical sound of the plural and possessive forms of most English nouns. It is often criticised as a form of hypercorrection coming from a widespread ignorance of the proper use of the apostrophe or of punctuation in general. Lynne Truss, author of Eats, Shoots & Leaves, points out that before the 19th century, it was standard orthography to use the apostrophe to form a plural of a foreign-sounding word that ended in a vowel (e.g., banana's, folio's, logo's, quarto's, pasta's, ouzo's) to clarify pronunciation. Truss says this usage is no longer considered proper in formal writing.[58]
The term is believed to have been coined in the middle of the 20th century by a teacher of languages working in Liverpool, at a time when such mistakes were common in the handwritten signs and advertisements of greengrocers (e.g., Apple's 1/- a pound, Orange's 1/6d a pound). Some have argued that its use in mass communication by employees of well-known companies has led to the less literate assuming it to be correct and adopting the habit themselves.[59]
The same use of apostrophe before noun plural -s forms is sometimes made by non-native speakers of English. For example, in Dutch, the apostrophe is inserted before the s when pluralising most words ending in a vowel or y for example, baby's (English babies) and radio's (English "radios"). This often produces so-called "Dunglish" errors when carried over into English.[60] Hyperforeignism has been formalised in some pseudo-anglicisms. For example, the French word pin's (from English pin) is used (with the apostrophe in both singular and plural) for collectable lapel pins. Similarly, there is an Andorran football club called FC Rànger's (after such British clubs as Rangers F.C.), a Japanese dance group called Super Monkey's, and a Japanese pop punk band called the Titan Go King's.[61]
The widespread use of apostrophes before the s of plural nouns has led to the incorrect belief that an apostrophe is also needed before the s of the third-person present tense of a verb. Thus, he take's, it begin's, etc.[citation needed]
There is a tendency to drop apostrophes in many commonly used names such as St Annes, St Johns Lane,[62] and so on.
In 2009, a resident in Royal Tunbridge Wells was accused of vandalism after he painted apostrophes on road signs that had spelt St John's Close as St Johns Close.[63]
UK supermarket chain Tesco omits the mark where standard practice would require it. Signs in Tesco advertise (among other items) "mens magazines", "girls toys", "kids books" and "womens shoes". In his book Troublesome Words, author Bill Bryson lambasts Tesco for this, stating that "the mistake is inexcusable, and those who make it are linguistic Neanderthals."[64]
George Bernard Shaw, a proponent of English spelling reform on phonetic principles, argued that the apostrophe was mostly redundant. He did not use it for spelling cant, hes, etc. in many of his writings. He did however allow I'm and it's.[65] Hubert Selby, Jr. used a slash instead of an apostrophe mark for contractions and did not use an apostrophe at all for possessives. Lewis Carroll made greater use of apostrophes, and frequently used sha'n't, with an apostrophe in place of the elided "ll" as well as the more usual "o".[66][citation needed] These authors' usages have not become widespread.
The British pop group Hear'Say famously made unconventional use of an apostrophe in its name. Truss comments that "the naming of Hear'Say in 2001 was [...] a significant milestone on the road to punctuation anarchy".[67] Dexys Midnight Runners, on the other hand, omit the apostrophe (though "dexys" can be understood as a plural form of "dexy", rather than a possessive form).
An apostrophe wrongly thought to be misused in popular culture occurs in the name of Liverpudlian rock band The La's. This apostrophe is often thought to be a mistake; but in fact it marks omission of the letter d. The name comes from the Scouse slang for "The Lads".
Over the years, the use of apostrophes has been criticized. George Bernard Shaw called them "uncouth bacilli". In his book, American Speech, linguist Steven Byington stated of the apostrophe that "the language would be none the worse for its abolition." Adrian Room in his English Journal article "Axing the Apostrophe" argued that apostrophes are unnecessary and context will resolve any ambiguity.[68] In a letter to the English Journal, Peter Brodie stated that apostrophes are "largely decorative...[and] rarely clarify meaning".[69] Dr. John C. Wells, Emeritus Professor of Phonetics at University College London, says the apostrophe is "a waste of time". Peter Buck, guitarist of R.E.M. claimed "We all hate apostrophes. There's never been a good rock album that's had an apostrophe in the title".[68]
In many languages, especially European languages, the apostrophe is used to indicate the elision of one or more sounds, as in English.
Other languages and transliteration systems use the apostrophe or some similar mark to indicate a glottal stop, sometimes considering it a letter of the alphabet:
The apostrophe represents sounds resembling the glottal stop in the Turkic languages and in some romanizations of Semitic languages, including Arabic. In typography, this function may be performed by the closing single quotation mark. In that case, the Arabic letter ‘ayn (ع) is correspondingly transliterated with the opening single quotation mark.
Some languages and transliteration systems use the apostrophe to mark the presence, or the lack of, palatalization.
Some languages use the apostrophe to separate the root of a word and its affixes, especially if the root is foreign and unassimilated. (For another kind of morphemic separation see pinyin, below.)
The form of the apostrophe originates in manuscript writing, as a point with a downwards tail curving clockwise. This form was inherited by the typographic apostrophe ( ’ ), also known as the typeset apostrophe, or, informally, the curly apostrophe. Later sans-serif typefaces had stylized apostrophes with a more geometric or simplified form, but usually retaining the same directional bias as a closing quotation mark.
With the invention of the typewriter, a "neutral" quotation mark form ( ' ) was created to economize on the keyboard, by using a single key to represent: the apostrophe, both opening and closing single quotation marks, single primes, and on some typewriters the exclamation point by overprinting with a period. This is known as the typewriter apostrophe or vertical apostrophe. The same convention was adopted for quotation marks.
Both simplifications carried over to computer keyboards and the ASCII character set. However, although these are widely used due to their ubiquity and convenience, they are deprecated in contexts where proper typography is important.[79]
The typewriter apostrophe ( ' ) was inherited by computer keyboards, and is the only apostrophe character available in the (7-bit) ASCII character encoding, at code value 0x27 (39). As such, it is a highly overloaded character. In ASCII, it represents a right single quotation mark, left single quotation mark, apostrophe, vertical line or prime (punctuation marks), or an acute accent (modifier letters).
Many earlier (pre 1985) computer displays and printers rendered the ASCII apostrophe as a typographic apostrophe, and rendered the ASCII grave accent ( ` ) U+0060 as a matching left single quotation mark. This allowed a more typographic appearance of text: ``I can't''
would appear as ‘‘I can’t’’
on these systems. This can still be seen in many documents prepared at that time, and is still used in the TeX typesetting system to create typographic quotes.
Support for the typographic apostrophe ( ’ ) was introduced in a variety of 8-bit character encodings, such as the Apple Macintosh operating system's Mac Roman character set (in 1984), and later in the CP1252 encoding of Microsoft Windows. There is no such character in ISO-8859-1.
Microsoft Windows CP1252 (sometimes incorrectly called ANSI or ISO-Latin) contains the typographic apostrophe at 0x92. Due to "smart quotes" in Microsoft software converting the ASCII apostrophe to this value, other software makers have been forced to adopt this as a de facto convention. For instance the HTML 5 standard specifies that this value is interpreted as CP1252. Some earlier non-Microsoft browsers would display a '?' for this and make web pages composed with Microsoft software somewhat hard to read.
There are several types of apostrophe character in Unicode:
Although ubiquitous in typeset material, the typographic apostrophe ( ’ ) is rather difficult to enter on a computer, since it does not have its own key on a standard keyboard. Outside the world of professional typesetting and graphic design, many people do not know how to enter this character and instead use the typewriter apostrophe ( ' ). The typewriter apostrophe has always been considered tolerable on Web pages because of the egalitarian nature of Web publishing and the low resolution of computer monitors in comparison to print.
More recently, the correct use of the typographic apostrophe is becoming more common on the Web due to the wide adoption of the Unicode text encoding standard, higher-resolution displays, and advanced anti-aliasing of text in modern operating systems. Because typewriter apostrophes are now often automatically converted to typographic apostrophes by wordprocessing and desktop-publishing software (see below), the typographic apostrophe does often appear in documents produced by non-professionals.
Unicode | (Decimal) | Macintosh | Windows-1252 Alt code | Linux/X | HTML entity |
---|---|---|---|---|---|
U+2019 | 8217 | Option + Shift + ] | Alt + 0146 on number pad | AltGr + shift + B or Compose ' > | ’ |
XML (and hence XHTML) defines an '
character entity reference for the ASCII typewriter apostrophe. No equivalent entity is defined in the HTML 4 standard,[82] despite all the other predefined character entities from XML being defined in HTML. If it cannot be entered literally in HTML, a numeric character reference could be used instead, such as "'" or "'".
To make typographic apostrophes easier to enter, wordprocessing and publishing software often converts typewriter apostrophes to typographic apostrophes during text entry (at the same time converting opening and closing single and double quotes to their correct left-handed or right-handed forms). A similar facility may be offered on web servers after submitting text in a form field, e.g. on weblogs or free encyclopedias. This is known as the smart quotes feature; apostrophes and quotation marks that are not automatically altered by computer programs are known as dumb quotes.
Such conversion is not always done in accordance with the standards for character sets and encodings. Additionally, many such software programs incorrectly convert a leading apostrophe to an opening quotation mark (e.g., in abbreviations of years: ‘29 rather than the correct ’29 for the years 1929 or 2029 (depending on context); or ‘twas instead of ’twas as the archaic abbreviation of it was. Smart quote features also often fail to recognise situations when a prime rather than an apostrophe is needed; for example, incorrectly rendering the latitude 49° 53′ 08″ as 49° 53’ 08”.
In Microsoft Word it is possible to turn smart quotes off (in some versions, by navigating through Tools, AutoCorrect, AutoFormat as you type, and then unchecking the appropriate option). Alternatively, typing Control-Z (for Undo) immediately after entering the apostrophe will convert it back to a typewriter apostrophe. In Microsoft Word for Windows, holding down the Control key while typing two apostrophes will produce a single typographic apostrophe.
Some programming languages, like Pascal, use the ASCII apostrophe to delimit string constants. Often either the apostrophe or the double quote may be used, allowing string constants to contain the other character (but not to contain both without using an escape character).
The C programming language (and many related languages like C++ or Java) uses apostrophes to delimit a character constant. In C it is seen as a character value only, different from a 1-letter string.
In Visual Basic an apostrophe is used to denote the start of a comment.
Look up apostrophe in Wiktionary, the free dictionary. |
This article contains instructions, advice, or how-to content. The purpose of Wikipedia is to present facts, not to train. Please help improve this article either by rewriting the how-to content or by moving it to Wikiversity or Wikibooks. (July 2011) |
In computing, a newline,[1] also known as a line break or end-of-line (EOL) marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the text immediately preceding the newline. The actual codes representing a newline vary across operating systems, which can be a problem when exchanging text files between systems with different newline representations.
There is also some confusion whether newlines terminate or separate lines. If a newline is considered a separator, there will be no newline after the last line of a file. The general convention on most systems is to add a newline even after the last line, i.e. to treat newline as a line terminator. Some programs have problems processing the last line of a file if it is not newline terminated. Conversely, programs that expect newline to be used as a separator will interpret a final newline as starting a new (empty) line.
In text intended primarily to be read by humans using software which implements the word wrap feature, a newline character typically only needs to be stored if a line break is required independent of whether the next word would fit on the same line, such as between paragraphs and in vertical lists. See hard return and soft return.
Contents |
Software applications and operating systems usually represent a newline with one or two control characters:
Most textual Internet protocols (including HTTP, SMTP, FTP, IRC and many others) mandate the use of ASCII CR+LF (0x0D 0x0A) on the protocol level, but recommend that tolerant applications recognize lone LF as well. In practice, there are many applications that erroneously use the C newline character '\n' instead (see section Newline in programming languages below). This leads to problems when trying to communicate with systems adhering to a stricter interpretation of the standards; one such system is the qmail MTA that actively refuses to accept messages from systems that send bare LF instead of the required CR+LF.[2]
FTP has a feature to transform newlines between CR+LF and LF only when transferring text files. This must not be used on binary files. Usually binary files and text files are recognised by checking their filename extension.
The Unicode standard defines a large number of characters that conforming applications should recognize as line terminators:[3]
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
This may seem overly complicated compared to an approach such as converting all line terminators to a single character, for example LF. However, Unicode was designed to preserve all information when converting a text file from any existing encoding to Unicode and back. Therefore, Unicode should contain characters included in existing encodings. NEL is included in ISO-8859-1[citation needed] and EBCDIC (0x15). The approach taken in the Unicode standard allows round-trip transformation to be information-preserving while still enabling applications to recognize all possible types of line terminators.
Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in UTF-8 and the code for NEL has been used as the ellipsis ('…') character in Windows-1252. For instance:
ASCII was developed simultaneously by the ISO and the ASA, the predecessor organization to ANSI. During the period of 1963–1968, the ISO draft standards supported the use of either CR+LF or LF alone as a newline, while the ASA drafts supported only CR+LF.
The sequence CR+LF was in common use on many early computer systems that had adopted Teletype machines, typically a Teletype Model 33 ASR, as a console device, because this sequence was required to position those printers at the start of a new line. On these systems, text was often routinely composed to be compatible with these printers, since the concept of device drivers hiding such hardware details from the application was not yet well developed; applications had to talk directly to the Teletype machine and follow its conventions.
Most minicomputer systems from DEC used this convention. CP/M used it as well, to print on the same terminals that minicomputers used. From there MS-DOS (1981) adopted CP/M's CR+LF in order to be compatible, and this convention was inherited by Microsoft's later Windows operating system.
The separation of the two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in one-character time. That is why the sequence was always sent with the CR first. In fact, it was often necessary to send extra characters (extraneous CRs or NULs, which are ignored) to give the print head time to move to the left margin. Even many early video displays required multiple character times to scroll the display.
The Multics operating system began development in 1964 and used LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was much more convenient for programming. The seemingly more obvious choice of CR was not used, as a plain CR provided the useful function of overprinting one line with another, and thus it was useful to not translate it. Unix followed the Multics practice, and later systems followed Unix.
To facilitate the creation of portable programs, programming languages provide some abstractions to deal with the different types of newline sequences used in different environments.
The C programming language provides the escape sequences '\n' (newline) and '\r' (carriage return). However, these are not required to be equivalent to the ASCII LF and CR control characters. The C standard only guarantees two things:
On Unix platforms, where C originated, the native newline sequence is ASCII LF (0x0A), so '\n' was simply defined to be that value. With the internal and external representation being identical, the translation performed in text mode is a no-op, and text mode and binary mode behave the same. This has caused many programmers who developed their software on Unix systems simply to ignore the distinction completely, resulting in code that is not portable to different platforms.
The C library function fgets() is best avoided in binary mode because any file not written with the UNIX newline convention will be misread. Also, in text mode, any file not written with the system's native newline sequence (such as a file created on a UNIX system, then copied to a Windows system) will be misread as well.
Another common problem is the use of '\n' when communicating using an Internet protocol that mandates the use of ASCII CR+LF for ending lines. Writing '\n' to a text mode stream works correctly on Windows systems, but produces only LF on Unix, and something completely different on more exotic systems. Using "\r\n" in binary mode is slightly better.
Many languages, such as C++, Perl,[6] and Haskell provide the same interpretation of '\n' as C.
Java, PHP,[7] and Python[8] provide the '\r\n' sequence (for ASCII CR+LF). In contrast to C, these are guaranteed to represent the values U+000A and U+000D, respectively.
The Java I/O libraries do not transparently translate these into platform-dependent newline sequences on input or output. Instead, they provide functions for writing a full line that automatically add the native newline sequence, and functions for reading lines that accept any of CR, LF, or CR+LF as a line terminator (see BufferedReader.readLine()). The System.getProperty() method can be used to retrieve the underlying line separator.
Example:
String eol = System.getProperty( "line.separator" ); String lineColor = "Color: Red" + eol;
Python permits "Universal Newline Support" when opening a file for reading, when importing modules, and when executing a file.[9]
Some languages have created special variables, constants, and subroutines to facilitate newlines during program execution.
The different newline conventions often cause text files that have been transferred between systems of different types to be displayed incorrectly. For example, files originating on Unix or Apple Macintosh systems may appear as a single long line on some Windows programs. Conversely, when viewing a file originating from a Windows computer on a Unix system, the extra CR may be displayed as ^M at the end of each line or as a second line break.
The problem can be hard to spot if some programs handle the foreign newlines properly while others do not. For example, a compiler may fail with obscure syntax errors even though the source file looks correct when displayed on the console or in an editor. On a Unix system, the command cat -v myfile.txt will send the file to stdout (normally the terminal) and make the ^M visible, which can be useful for debugging. Modern text editors generally recognize all flavours of CR / LF newlines and allow the user to convert between the different standards. Web browsers are usually also capable of displaying text files and websites which use different types of newlines.
The File Transfer Protocol can automatically convert newlines in files being transferred between systems with different newline representations when the transfer is done in "ASCII mode". However, transferring binary files in this mode usually has disastrous results: Any occurrence of the newline byte sequence—which does not have line terminator semantics in this context, but is just part of a normal sequence of bytes—will be translated to whatever newline representation the other system uses, effectively corrupting the file. FTP clients often employ some heuristics (for example, inspection of filename extensions) to automatically select either binary or ASCII mode, but in the end it is up to the user to make sure his or her files are transferred in the correct mode. If there is any doubt as to the correct mode, binary mode should be used, as then no files will be altered by FTP, though they may display incorrectly.
This article contains instructions, advice, or how-to content. The purpose of Wikipedia is to present facts, not to train. Please help improve this article either by rewriting the how-to content or by moving it to Wikiversity or Wikibooks. (June 2010) |
Text editors are often used for converting a text file between different newline formats; most modern editors can read and write files using at least the different ASCII CR/LF conventions. The standard Windows editor Notepad is not one of them (although Wordpad and the MS-DOS Editor are).
Editors are often unsuitable for converting larger files. For larger files (on Windows NT/2000/XP) the following command is often used:
TYPE unix_file | FIND "" /V > dos_file
On many Unix systems, the dos2unix (sometimes named fromdos or d2u) and unix2dos (sometimes named todos or u2d) utilities are used to translate between ASCII CR+LF (DOS/Windows) and LF (Unix) newlines. Different versions of these commands vary slightly in their syntax. However, the tr command is available on virtually every Unix-like system and is used to perform arbitrary replacement operations on single characters. A DOS/Windows text file can be converted to Unix format by simply removing all ASCII CR characters with
tr -d '\r' < inputfile > outputfile
or, if the text has only CR newlines, by converting all CR newlines to LF with
tr '\r' '\n' < inputfile > outputfile
The same tasks are sometimes performed with awk, sed, Tr_(Unix) or in Perl if the platform has a Perl interpreter:
awk '{sub("$","\r\n"); printf("%s",$0);}' inputfile > outputfile # UNIX to DOS (adding CRs on Linux and BSD based OS that haven't GNU extensions) awk '{gsub("\r",""); print;}' inputfile > outputfile # DOS to UNIX (removing CRs on Linux and BSD based OS that haven't GNU extensions) sed -e 's/$/\r/' inputfile > outputfile # UNIX to DOS (adding CRs on Linux based OS that use GNU extensions) sed -e 's/\r$//' inputfile > outputfile # DOS to UNIX (removing CRs on Linux based OS that use GNU extensions) cat inputfile | tr -d "\r" > outputfile # DOS to UNIX (removing CRs using tr(1). Not Unicode compliant.) perl -pe 's/\r?\n|\r/\r\n/g' inputfile > outputfile # Convert to DOS perl -pe 's/\r?\n|\r/\n/g' inputfile > outputfile # Convert to UNIX perl -pe 's/\r?\n|\r/\r/g' inputfile > outputfile # Convert to old Mac
To identify what type of line breaks a text file contains, the file command can be used. Moreover, the editor Vim can be convenient to make a file compatible with the Windows notepad text editor. For example:
[prompt] > file myfile.txt myfile.txt: ASCII English text [prompt] > vim myfile.txt within vim :set fileformat=dos :wq [prompt] > file myfile.txt myfile.txt: ASCII English text, with CRLF line terminators
The following grep commands echo the filename (in this case myfile.txt) to the command line if the file is of the specified style:
grep -PL $'\r\n' myfile.txt # show UNIX style file (LF terminated) grep -Pl $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
For Debian-based systems, these commands are used:
egrep -L $'\r\n' myfile.txt # show UNIX style file (LF terminated) egrep -l $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
The above grep commands work under Unix systems or in Cygwin under Windows. Note that these commands make some assumptions about the kinds of files that exist on the system (specifically it's assuming only UNIX and DOS-style files—no Mac OS 9-style files).
This technique is often combined with find to list files recursively. For instance, the following command checks all "regular files" (e.g. it will exclude directories, symbolic links, etc.) to find all UNIX-style files in a directory tree, starting from the current directory (.), and saves the results in file unix_files.txt, overwriting it if the file already exists:
find . -type f -exec grep -PL '\r\n' {} \; > unix_files.txt
This example will find C files and convert them to LF style line endings:
find -name '*.[ch]' -exec fromdos {} \;
The file command also detects the type of EOL used:
file myfile.txt > myfile.txt: ASCII text, with CRLF line terminators
Other tools permit the user to visualise the EOL characters:
od -a myfile.txt cat -e myfile.txt hexdump -c myfile.txt
dos2unix, unix2dos, mac2unix, unix2mac, mac2dos, dos2mac can perform conversions. The flip[10] command is often used.