Par Damien Raude-Morvan et David Morin
Présentation au Nantes Java User Group - 15 décembre 2014
Je vais importer directement mes fichiers de logs dans HDFS !
2014-12-13 14:17:48 RelevanceFilter [INFO] Verbatim result: relevant scored, emitted
2014-12-13 14:19:34 RelevanceFilter [INFO] Verbatim result: relevant scored, emitted
2014-12-13 14:27:30 RelevanceFilter [INFO] Verbatim result: relevant scored, emitted
2014-12-13 14:28:07 RelevanceFilter [INFO] Verbatim result: relevant scored, emitted
2014-12-13T14:17:48 d.a.r.RelevanceFilter [INFO] Verbatim result: relevant scored, emitted, (key=verbatim://mission/54631a440cf2118b4f346d6e/1e482ca4d4f3aa00e0745e409903b142, relevance=1.0)
2014-12-13T14:19:34 d.a.r.RelevanceFilter [INFO] Verbatim result: relevant scored, emitted, (key=verbatim://mission/54631a440cf2118b4f346d6e/1e482caa1f2ba500e0743d4617c85ffa, relevance=1.0)
2014-12-13T14:27:30 d.a.r.RelevanceFilter [INFO] Verbatim result: relevant scored, emitted, (key=verbatim://mission/546369e20cf2118b4f3d8266/1e482cbbc795ae00e0745c0a84ceaea8, relevance=1.0)
2014-12-13T14:28:07 d.a.r.RelevanceFilter [INFO] Verbatim result: relevant scored, emitted, (key=verbatim://mission/54631a440cf2118b4f346d6e/1e482cbd9159a600e074660bbb69ef8e, relevance=1.0)
namespace java com.example.project
enum TweetType {
TWEET,
RETWEET = 2,
DM = 0xa,
REPLY
}
struct Tweet {
1: i32 userId,
2: string userName,
3: string text,
4: TweetType tweetType = TweetType.TWEET
}
namespace java com.example.project
enum TweetType {
TWEET,
RETWEET = 2,
DM = 0xa,
REPLY
}
struct Tweet {
1: i32 userId,
2: string userName,
3: string text,
4: TweetType tweetType = TweetType.TWEET,
// Add Location !
5: Location location
}
Deux niveaux :
clé-valeur | orientée colonnes | orientée documents | autre |
---|---|---|---|
![]() |
![]() ![]() ![]() |
![]() ![]() |
JDBC |
{
"type": "record",
"name": "Pageview",
"namespace": "org.apache.gora.tutorial.log.generated",
"fields": [
{"name": "url", "type": "string"},
{"name": "timestamp", "type": "long"},
{"name": "ip", "type": "string"},
{"name": "httpMethod", "type": "string"},
{"name": "httpStatusCode", "type": "int"},
{"name": "responseSize", "type": "int"},
{"name": "referrer", "type": "string"},
{"name": "userAgent","type": "string"}
]
}
bin/gora goracompiler gora-tutorial/src/main/avro/pageview.json target/generated-sources/
]]>
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
gora.datastore.autocreateschema=true
gora.datastore.scanner.caching=1000
hbase.client.autoflush.default=false
inStore = DataStoreFactory.
getDataStore(dataStoreClass, Long.class, Pageview.class, conf);
outStore = DataStoreFactory.
getDataStore(String.class, MetricDatum.class, conf);
Job job = new Job(getConf());
job.setJobName("Log Analytics");
job.setNumReduceTasks(numReducer);
job.setJarByClass(getClass());
GoraMapper.initMapperJob(job, inStore,
TextLong.class, LongWritable.class, XXXXMapper.class, true);
GoraReducer.initReducerJob(job, outStore,
XXXXReducer.class);
Exemple concret mis en oeuvre dans le passé : l'ordonnanceur "historique" fournit un fichier avec des variables d'environnement au script Pig permettant de contextualiser le Job MR sur le jobtracker (nom du Job initial, environnement, heure de lancement, user, etc...)
-- Load file on HDFS
-- Ex : ceci est un exemple de ligne pour le wordcount
lines = LOAD '/user/XXX/wc.txt' AS (line:chararray);
-- Iterate on each line
-- We use TOKENISE to split by word and FLATTEN to obtain a tuple
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group by word
grouped = GROUP words BY word;
-- Count number of occurences for each group (word)
wordcount = FOREACH grouped GENERATE group, COUNT(words);
-- Display results on sysout
DUMP wordcount;
À l'échelle d'Internet, de plus en plus de données sont produites
Twitter / Facebook / Disqus
applicatifs / techniques
clicks, achat, navigation
capteurs
RabbitMQ, Kafka, JMS
Stockage de l'état du cluster (répartition des travaux)
Démarre et stoppe les différents workers