Sub-task
- [NUTCH-1124] - JUnit test for scoring-opic
- [NUTCH-1125] - JUnit test for tld
- [NUTCH-1164] - Write JUnit tests for protocol-http
- [NUTCH-1170] - Write JUnit tests for urlfilter-validator
- [NUTCH-1655] - Indexer Plugin for Elastic Search
- [NUTCH-1878] - urlnormalizer-regex to keep third slash in file:///path/index.html
- [NUTCH-1879] - Regex URL normalizer should remove multiple slashes after file: protocol
- [NUTCH-1880] - URLUtil should not add additional slashes for file URLs
- [NUTCH-1885] - Protocol-file should treat symbolic links as redirects
Bug
- [NUTCH-356] - Plugin repository cache can lead to memory leak
- [NUTCH-385] - Improve description of thread related configuration for Fetcher
- [NUTCH-797] - URL not properly constructed when link target begins with a "?"
- [NUTCH-911] - recrawls file protocol causes Errors/Exceptions when actually not modified or gone
- [NUTCH-970] - Injector job crashes with MySQL with table collation set to utf8_general_ci
- [NUTCH-992] - SolrDedup is broken in 2.x
- [NUTCH-1182] - fetcher to log hung threads
- [NUTCH-1253] - Incompatible neko and xerces versions
- [NUTCH-1329] - parser not extract outlinks to external web sites
- [NUTCH-1410] - impact of a map-reduce problem
- [NUTCH-1473] - Column length too big for column 'text' (max = 21845); use BLOB or TEXT instead
- [NUTCH-1481] - When using MySQL as storage unicode characters within URLS cause nutch to fail
- [NUTCH-1483] - Can't crawl filesystem with protocol-file plugin
- [NUTCH-1490] - Data Truncation exceptions when using mysql
- [NUTCH-1549] - Fix deprecated use of Tika MimeType API in o.a.n.util.MimeUtil
- [NUTCH-1562] - Order of execution for scoring filters
- [NUTCH-1566] - bin/nutch to allow whitespace in paths
- [NUTCH-1579] - NPE when using solr indexing
- [NUTCH-1587] - misspelled property "threshold" in conf/log4j.properties
- [NUTCH-1588] - Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x
- [NUTCH-1603] - ZIP parser complains about truncated PDF file
- [NUTCH-1604] - ProtocolFactory not thread-safe
- [NUTCH-1605] - mime type detector recognizes xlsx as zip file
- [NUTCH-1610] - Can't run individual unit tests for plugins in nutch 2.x
- [NUTCH-1613] - Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
- [NUTCH-1618] - Turn speculative execution off for Fetching
- [NUTCH-1621] - Deprecated class o.a.n.crawl.Crawler is still in code base
- [NUTCH-1624] - Typo in WebTableReader line 486
- [NUTCH-1633] - slf4j is provided by hadoop and should not be included in the job file.
- [NUTCH-1634] - readdb -stats show the result twice
- [NUTCH-1650] - Adaptive Fetch Scheduler interval Wrong Set
- [NUTCH-1651] - modifiedTime and prevmodifiedTime never set
- [NUTCH-1657] - ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser
- [NUTCH-1667] - Updatedb always ignore batchId
- [NUTCH-1671] - indexchecker to add digest field
- [NUTCH-1672] - Inlinks are added twice in DbUpdateReducer
- [NUTCH-1673] - Title isn't reset in MoreIndexingFilter
- [NUTCH-1677] - ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are not set in Parse HTML
- [NUTCH-1685] - URLUtil.toUNICODE fails on IDNs
- [NUTCH-1699] - Tika Parser - Image Parse Bug
- [NUTCH-1708] - use same id when indexing and deleting redirects
- [NUTCH-1715] - RobotRulesParser adds additional '*' to the robots name
- [NUTCH-1716] - RobotRulesParser adds extra '*' to the robots name
- [NUTCH-1718] - redefine http.robots.agent as "additional agent names"
- [NUTCH-1719] - DomainStatistics fails in 2.x because URL is not unreversed
- [NUTCH-1720] - Duplicate lines in HttpBase.java
- [NUTCH-1725] - CleaningJob's reducer does not commit deleted docs.
- [NUTCH-1727] - Configurable length for Tlds
- [NUTCH-1728] - indexer-solr plugin is not delete docs from solr
- [NUTCH-1733] - parse-html to support HTML5 charset definitions
- [NUTCH-1736] - Can't fetch page if http response header contains Transfer-Encoding:chunked
- [NUTCH-1738] - Expose number of URLs generated per batch in GeneratorJob
- [NUTCH-1751] - Empty anchors should not index
- [NUTCH-1752] - cache robots.txt rules per protocol:host:port
- [NUTCH-1753] - Eclipse dependecy problem for 2.x
- [NUTCH-1755] - Project name bug in build.xml
- [NUTCH-1759] - Upgrade to Crawler Commons 0.4
- [NUTCH-1761] - Crawl script fails to find job file if not started from inside bin dir
- [NUTCH-1767] - remove special treatment of "params" in relative links
- [NUTCH-1773] - Solr Indexer fails
- [NUTCH-1774] - Crawling from REST API giving NullPointerException
- [NUTCH-1776] - Log incorrect plugin.folder file path
- [NUTCH-1778] - Generator not logging number of URLs in batch correctly
- [NUTCH-1780] - ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
- [NUTCH-1784] - modifiedTime and prevmodifiedTime never set
- [NUTCH-1788] - Tika may return multiple values for Title on PDF's
- [NUTCH-1796] - Ensure Gora object builders are used as oppose to empty constructors.
- [NUTCH-1798] - Crawl script not calling index command correctly
- [NUTCH-1811] - bin/nutch junit to use junit 4 test runner
- [NUTCH-1819] - Check for batchId input in GeneratorJob#run
- [NUTCH-1820] - remove field "orig" which duplicates "id"
- [NUTCH-1825] - protocol-http may hang for certain web pages
- [NUTCH-1828] - bin/crawl : incorrect handling of nutch errors
- [NUTCH-1829] - Generator : unable to distinguish real errors
- [NUTCH-1832] - Make Nutch work without an indexer
- [NUTCH-1834] - GeneratorMapper behavior depends on log level
- [NUTCH-1845] - Nutch cannot save inlinks
- [NUTCH-1848] - Bug in DashboardPage.html instances counter
- [NUTCH-1865] - Enable use of SNAPSHOT's with Nutch Ivy dependency management
- [NUTCH-1866] - ant eclipse target should not delete runtime
- [NUTCH-1877] - Suffix URL filter to ignore query string by default
- [NUTCH-1882] - ant eclipse target to add output path to src/test
- [NUTCH-1891] - Can't run nutch2.3-snapshot on hadoop2.4.0 using gora0.5 and mongodb as backend datastore
- [NUTCH-1899] - upgrade restlet lib to prevent build failure
- [NUTCH-1903] - Resolve-default failed with branch 2.x
- [NUTCH-1907] - Incorrect output of Outlinks to Hosts within HostDbUpdateReducer
New Feature
- [NUTCH-929] - Create a REST-based admin UI for Nutch
- [NUTCH-1360] - Suport the storing of IP address connected to when web crawling
- [NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc
- [NUTCH-1693] - TextMD5Signature computed on textual content
- [NUTCH-1856] - Document webpage.avsc and host.avsc
Improvement
- [NUTCH-841] - Create a Wicket-based Web Application for Nutch
- [NUTCH-945] - Indexing to multiple SOLR Servers
- [NUTCH-1294] - IndexClean job with solr implementation.
- [NUTCH-1409] - Remove deprecated properties db.{default,max}.fetch.interval, generate.max.per.host.by.ip
- [NUTCH-1413] - Record response time
- [NUTCH-1478] - Parse-metatags and index-metadata plugin for Nutch 2.x series
- [NUTCH-1497] - Better default gora-sql-mapping.xml with larger field sizes for MySQL
- [NUTCH-1513] - Support Robots.txt for Ftp urls
- [NUTCH-1556] - enabling updatedb to accept batchId
- [NUTCH-1568] - port pluggable indexing architecture to 2.x
- [NUTCH-1595] - Upgrade to Tika 1.4
- [NUTCH-1599] - Obtain consensus on new description of Nutch
- [NUTCH-1619] - Writes Dmoz Description and Title information to db with snippet argument
- [NUTCH-1629] - there is no need to fail on empty lines in seed file when injecting.
- [NUTCH-1631] - Display Document Count Added To Solr Server
- [NUTCH-1632] - add batchId argument for DbUpdaterJob
- [NUTCH-1641] - Log timings for main jobs
- [NUTCH-1674] - Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
- [NUTCH-1710] - Add gora package logging to log4j.properties
- [NUTCH-1714] - Nutch 2.x upgrade to Gora 0.4
- [NUTCH-1721] - Upgrade to Crawler commons 0.3
- [NUTCH-1731] - Better cmd line parsing for NutchServer
- [NUTCH-1743] - parsechecker to show outlinks
- [NUTCH-1768] - Upgrade to ElasticSearch 1.1.0
- [NUTCH-1769] - REST API refactoring
- [NUTCH-1781] - Update gora-*-mapping.xml and gora.proeprties to reflect Gora 0.4
- [NUTCH-1782] - NodeWalker to return current node
- [NUTCH-1787] - update and complete API doc overview page
- [NUTCH-1797] - remove unused package o.a.n.html
- [NUTCH-1823] - Upgrade to elasticsearch 1.4.1
- [NUTCH-1827] - Port NUTCH-1467 and NUTCH-1561 to 2.x
- [NUTCH-1843] - Upgrade to Gora 0.5
- [NUTCH-1851] - Add/Update wiki pages for NutchServer and WebApp
- [NUTCH-1876] - Upgrade to Crawler Commons 0.5
- [NUTCH-1883] - bin/crawl: use function to run bin/nutch and check exit value
- [NUTCH-1888] - Specify HTMLMapper to use in TikaParser
Test
- [NUTCH-1645] - Junit Test Case for Adaptive Fetch Schedule class
Task
- [NUTCH-1696] - Enable use of (Gora) SNAPSHOT dependencies
- [NUTCH-1700] - Remove deprecated code in src/plugin/creativecommons/build.xml
- [NUTCH-1779] - Apply formatting to the code
- [NUTCH-1789] - Migrate Nutch site to Apache CMS
- [NUTCH-1817] - Remove pom.xml from source
- [NUTCH-1837] - Upgrade to Tika 1.6
- [NUTCH-1859] - Make Nutch webapp port configurable
Edit/Copy Release Notes
The text area below allows the project release notes to be edited and copied to another document.