learnxinyminutes-docs/kdb+.md
2024-12-09 04:34:00 -07:00

22 KiB

name contributors filename
kdb+
Matt Doherty
https://github.com/picodoc
Jonny Press
https://github.com/jonnypress
learnkdb.q

The q language and its database component kdb+ were developed by Arthur Whitney and released by Kx systems in 2003. q is a descendant of APL and as such is very terse and a little strange looking for anyone from a "C heritage" language background. Its expressiveness and vector oriented nature make it well suited to performing complex calculations on large amounts of data (while also encouraging some amount of code golf). The fundamental structure in the language is not the object but instead the list, and tables are built as collections of lists. This means - unlike most traditional RDBMS systems - tables are column oriented. The language has both an in-memory and on-disk database built in, giving a large amount of flexibility. kdb+ is most widely used in the world of finance to store, analyze, process and retrieve large time-series data sets.

The terms q and kdb+ are usually used interchangeably, as the two are not separable so this distinction is not really useful.

To learn more about kdb+ you can join the KX Community forums or the TorQ kdb+ group.

/ Single line comments start with a forward-slash
/ These can also be used in-line, so long as at least one whitespace character
/ separates it from text to the left
/
  A forward-slash on a line by itself starts a multiline comment
  and a backward-slash on a line by itself terminates it
\

/ Run this file in an empty directory


////////////////////////////////////
// Basic Operators and Datatypes  //
////////////////////////////////////

/ We have integers, which are 8 byte by default
3 / => 3

/ And floats, also 8 byte as standard. Trailing f distinguishes from int
3.0 / => 3f

/ 4 byte numerical types can also be specified with trailing chars
3i / => 3i
3.0e / => 3e

/ Math is mostly what you would expect
1+1 / => 2
8-1 / => 7
10*2 / => 20
/ Except division, which uses percent (%) instead of forward-slash (/)
35%5 / => 7f  (the result of division is always a float)

/ For integer division we have the keyword div
4 div 3 / => 1

/ Modulo also uses a keyword, since percent (%) is taken
4 mod 3 / => 1

/ And exponentiation...
2 xexp 4 / => 16

/ ...and truncating...
floor 3.14159 / => 3

/ ...getting the absolute value...
abs -3.14159 / => 3.14159
/ ...and many other things
/ see http://code.kx.com/q/ref/ for more

/ q has no operator precedence, everything is evaluated right to left
/ so results like this might take some getting used to
2*1+1 / => 4 / (no operator precedence tables to remember!)

/ Precedence can be modified with parentheses (restoring the 'normal' result)
(2*1)+1 / => 3

/ Assignment uses colon (:) instead of equals (=)
/ No need to declare variables before assignment
a:3
a / => 3

/ Variables can also be assigned in-line
/ this does not affect the value passed on
c:3+b:2+a:1 / (data "flows" from right to left)
a / => 1
b / => 3
c / => 6

/ In-place operations are also as you might expect
a+:2
a / => 3

/ There are no "true" or "false" keywords in q
/ boolean values are indicated by the bit value followed by b
1b / => true value
0b / => false value

/ Equality comparisons use equals (=) (since we don't need it for assignment)
1=1 / => 1b
2=1 / => 0b

/ Inequality uses <>
1<>1 / => 0b
2<>1 / => 1b

/ The other comparisons are as you might expect
1<2 / => 1b
1>2 / => 0b
2<=2 / => 1b
2>=2 / => 1b

/ Comparison is not strict with regard to types...
42=42.0 / => 1b

/ ...unless we use the match operator (~)
/ which only returns true if entities are identical
42~42.0 / => 0b

/ The not operator returns true if the underlying value is zero
not 0b / => 1b
not 1b / => 0b
not 42 / => 0b
not 0.0 / => 1b

/ The max operator (|) reduces to logical "or" for bools
42|2.0 / => 42f
1b|0b / => 1b

/ The min operator (&) reduces to logical "and" for bools
42&2.0 / => 2f
1b&0b / => 0b

/ q provides two ways to store character data
/ Chars in q are stored in a single byte and use double-quotes (")
ch:"a"
/ Strings are simply lists of char (more on lists later)
str:"This is a string"
/ Escape characters work as normal
str:"This is a string with \"quotes\""

/ Char data can also be stored as symbols using backtick (`)
symbol:`sym
/ Symbols are NOT LISTS, they are an enumeration
/ the q process stores internally a vector of strings
/ symbols are enumerated against this vector
/ this can be more space and speed efficient as these are constant width

/ The string function converts to strings
string `symbol / => "symbol"
string 1.2345 / => "1.2345"

/ q has a time type...
t:01:00:00.000
/ date type...
d:2015.12.25
/ and a datetime type (among other time types)
dt:2015.12.25D12:00:00.000000000

/ These support some arithmetic for easy manipulation
dt + t / => 2015.12.25D13:00:00.000000000
t - 00:10:00.000 / => 00:50:00.000
/ and can be decomposed using dot notation
d.year / => 2015i
d.mm / => 12i
d.dd / => 25i
/ see http://code.kx.com/q4m3/2_Basic_Data_Types_Atoms/#25-temporal-data for more

/ q also has an infinity value so div by zero will not throw an error
1%0 / => 0w
-1%0 / => -0w

/ And null types for representing missing values
0N / => null int
0n / => null float
/ see http://code.kx.com/q4m3/2_Basic_Data_Types_Atoms/#27-nulls for more

/ q has standard control structures
/ if is as you might expect (; separates the condition and instructions)
if[1=1;a:"hi"]
a / => "hi"
/ if-else uses $ (and unlike if, returns a value)
$[1=0;a:"hi";a:"bye"] / => "bye"
a / => "bye"
/ if-else can be extended to multiple clauses by adding args separated by ;
$[1=0;a:"hi";0=1;a:"bye";a:"hello again"]
a / => "hello again"


////////////////////////////////////
////      Data Structures       ////
////////////////////////////////////

/ q is not an object oriented language
/ instead complexity is built through ordered lists
/ and mapping them into higher order structures: dictionaries and tables

/ Lists (or arrays if you prefer) are simple ordered collections
/ they are defined using parentheses () and semi-colons (;)
(1;2;3) / => 1 2 3
(-10.0;3.14159e;1b;`abc;"c")
/ => -10f
/ => 3.14159e
/ => 1b
/ => `abc
/ => "c"  (mixed type lists are displayed on multiple lines)
((1;2;3);(4;5;6);(7;8;9))
/ => 1 2 3
/ => 4 5 6
/ => 7 8 9

/ Lists of uniform type can also be defined more concisely
1 2 3 / => 1 2 3
`list`of`syms / => `list`of`syms
`list`of`syms ~ (`list;`of;`syms) / => 1b

/ List length
count (1;2;3) / => 3
count "I am a string" / => 13 (string are lists of char)

/ Empty lists are defined with parentheses
l:()
count l / => 0

/ Simple variables and single item lists are not equivalent
/ parentheses syntax cannot create a single item list (they indicate precedence)
(1)~1 / => 1b
/ single item lists can be created using enlist
singleton:enlist 1
/ or appending to an empty list
singleton:(),1
1~(),1 / => 0b

/ Speaking of appending, comma (,) is used for this, not plus (+)
1 2 3,4 5 6 / => 1 2 3 4 5 6
"hello ","there" / => "hello there"

/ Indexing uses square brackets []
l:1 2 3 4
l[0] / => 1
l[1] / => 2
/ indexing out of bounds returns a null value rather than an error
l[5] / => 0N
/ and indexed assignment
l[0]:5
l / => 5 2 3 4

/ Lists can also be used for indexing and indexed assignment
l[1 3] / => 2 4
l[1 3]: 1 3
l / => 5 1 3 3

/ Lists can be untyped/mixed type
l:(1;2;`hi)
/ but once they are uniformly typed, q will enforce this
l[2]:3
l / => 1 2 3
l[2]:`hi / throws a type error
/ this makes sense in the context of lists as table columns (more later)

/ For a nested list we can index at depth
l:((1;2;3);(4;5;6);(7;8;9))
l[1;1] / => 5

/ We can elide the indexes to return entire rows or columns
l[;1] / => 2 5 8
l[1;] / => 4 5 6

/ All the functions mentioned in the previous section work on lists natively
1+(1;2;3) / => 2 3 4 (single variable and list)
(1;2;3) - (3;2;1) / => -2 0 2 (list and list)

/ And there are many more that are designed specifically for lists
avg 1 2 3 / => 2f
sum 1 2 3 / => 6
sums 1 2 3 / => 1 3 6 (running sum)
last 1 2 3 / => 3
1 rotate 1 2 3 / => 2 3 1
/ etc.
/ Using and combining these functions to manipulate lists is where much of the
/ power and expressiveness of the language comes from

/ Take (#), drop (_) and find (?) are also useful working with lists
l:1 2 3 4 5 6 7 8 9
l:1+til 9 / til is a useful shortcut for generating ranges
/ take the first 5 elements
5#l / => 1 2 3 4 5
/ drop the first 5
5_l / => 6 7 8 9
/ take the last 5
-5#l / => 5 6 7 8 9
/ drop the last 5
-5_l / => 1 2 3 4
/ find the first occurrence of 4
l?4 / => 3
l[3] / => 4

/ Dictionaries in q are a generalization of lists
/ they map a list to another list (of equal length)
/ the bang (!) symbol is used for defining a dictionary
d:(`a;`b;`c)!(1;2;3)
/ or more simply with concise list syntax
d:`a`b`c!1 2 3
/ the keyword key returns the first list
key d / => `a`b`c
/ and value the second
value d / => 1 2 3

/ Indexing is identical to lists
/ with the first list as a key instead of the position
d[`a] / => 1
d[`b] / => 2

/ As is assignment
d[`c]:4
d
/ => a| 1
/ => b| 2
/ => c| 4

/ Arithmetic and comparison work natively, just like lists
e:(`a;`b;`c)!(2;3;4)
d+e
/ => a| 3
/ => b| 5
/ => c| 8
d-2
/ => a| -1
/ => b| 0
/ => c| 2
d > (1;1;1)
/ => a| 0
/ => b| 1
/ => c| 1

/ And the take, drop and find operators are remarkably similar too
`a`b#d
/ => a| 1
/ => b| 2
`a`b _ d
/ => c| 4
d?2
/ => `b

/ Tables in q are basically a subset of dictionaries
/ a table is a dictionary where all values must be lists of the same length
/ as such tables in q are column oriented (unlike most RDBMS)
/ the flip keyword is used to convert a dictionary to a table
/ i.e. flip the indices
flip `c1`c2`c3!(1 2 3;4 5 6;7 8 9)
/ => c1 c2 c3
/ => --------
/ => 1  4  7
/ => 2  5  8
/ => 3  6  9
/ we can also define tables using this syntax
t:([]c1:1 2 3;c2:4 5 6;c3:7 8 9)
t
/ => c1 c2 c3
/ => --------
/ => 1  4  7
/ => 2  5  8
/ => 3  6  9

/ Tables can be indexed and manipulated in a similar way to dicts and lists
t[`c1]
/ => 1 2 3
/ table rows are returned as dictionaries
t[1]
/ => c1| 2
/ => c2| 5
/ => c3| 8

/ meta returns table type information
meta t
/ => c | t f a
/ => --| -----
/ => c1| j
/ => c2| j
/ => c3| j
/ now we see why type is enforced in lists (to protect column types)
t[1;`c1]:3
t[1;`c1]:3.0 / throws a type error

/ Most traditional databases have primary key columns
/ in q we have keyed tables, where one table containing key columns
/ is mapped to another table using bang (!)
k:([]id:1 2 3)
k!t
/ => id| c1 c2 c3
/ => --| --------
/ => 1 | 1  4  7
/ => 2 | 3  5  8
/ => 3 | 3  6  9

/ We can also use this shortcut for defining keyed tables
kt:([id:1 2 3]c1:1 2 3;c2:4 5 6;c3:7 8 9)

/ Records can then be retrieved based on this key
kt[1]
/ => c1| 1
/ => c2| 4
/ => c3| 7
kt[`id!1]
/ => c1| 1
/ => c2| 4
/ => c3| 7


////////////////////////////////////
////////     Functions      ////////
////////////////////////////////////

/ In q the function is similar to a mathematical map, mapping inputs to outputs
/ curly braces {} are used for function definition
/ and square brackets [] for calling functions (just like list indexing)
/ a very minimal function
f:{x+x}
f[2] / => 4

/ Functions can be anonymous and called at point of definition
{x+x}[2] / => 4

/ By default the last expression is returned
/ colon (:) can be used to specify return
{x+x}[2] / => 4
{:x+x}[2] / => 4
/ semi-colon (;) separates expressions
{r:x+x;:r}[2] / => 4

/ Function arguments can be specified explicitly (separated by ;)
{[arg1;arg2] arg1+arg2}[1;2] / => 3
/ or if omitted will default to x, y and z
{x+y+z}[1;2;3] / => 6

/ Built in functions are no different, and can be called the same way (with [])
+[1;2] / => 3
<[1;2] / => 1b

/ Functions are first class in q, so can be returned, stored in lists etc.
{:{x+y}}[] / => {x+y}
(1;"hi";{x+y})
/ => 1
/ => "hi"
/ => {x+y}

/ There is no overloading and no keyword arguments for custom q functions
/ however using a dictionary as a single argument can overcome this
/ allows for optional arguments or differing functionality
d:`arg1`arg2`arg3!(1.0;2;"my function argument")
{x[`arg1]+x[`arg2]}[d] / => 3f

/ Functions in q see the global scope
a:1
{:a}[] / => 1

/ However local scope obscures this
a:1
{a:2;:a}[] / => 2
a / => 1

/ Functions cannot see nested scopes (only local and global)
{local:1;{:local}[]}[] / throws error as local is not defined in inner function

/ A function can have one or more of its arguments fixed (projection)
f:+[4]
f[4] / => 8
f[5] / => 9
f[6] / => 10


////////////////////////////////////
//////////     q-sql      //////////
////////////////////////////////////

/ q has its own syntax for manipulating tables, similar to standard SQL
/ This contains the usual suspects of select, insert, update etc.
/ and some new functionality not typically available
/ q-sql has two significant differences (other than syntax) to normal SQL:
/ - q tables have well defined record orders
/ - tables are stored as a collection of columns
/   (so vectorized column operations are fast)
/ a full description of q-sql is a little beyond the scope of this intro
/ so we will just cover enough of the basics to get you going

/ First define ourselves a table
t:([]name:`Arthur`Thomas`Polly;age:35 32 52;height:180 175 160;sex:`m`m`f)

/ equivalent of SELECT * FROM t
select from t / (must be lower case, and the wildcard is not necessary)
/ => name   age height sex
/ => ---------------------
/ => Arthur 35  180    m
/ => Thomas 32  175    m
/ => Polly  52  160    f

/ Select specific columns
select name,age from t
/ => name   age
/ => ----------
/ => Arthur 35
/ => Thomas 32
/ => Polly  52

/ And name them (equivalent of using AS in standard SQL)
select charactername:name, currentage:age from t
/ => charactername currentage
/ => ------------------------
/ => Arthur        35
/ => Thomas        32
/ => Polly         52

/ This SQL syntax is integrated with the q language
/ so q can be used seamlessly in SQL statements
select name, feet:floor height*0.032, inches:12*(height*0.032) mod 1 from t
/ => name   feet inches
/ => ------------------
/ => Arthur 5    9.12
/ => Thomas 5    7.2
/ => Polly  5    1.44

/ Including custom functions
select name, growth:{[h;a]h%a}[height;age] from t
/ => name   growth
/ => ---------------
/ => Arthur 5.142857
/ => Thomas 5.46875
/ => Polly  3.076923

/ The where clause can contain multiple statements separated by commas
select from t where age>33,height>175
/ => name   age height sex
/ => ---------------------
/ => Arthur 35  180    m

/ The where statements are executed sequentially (not the same as logical AND)
select from t where age<40,height=min height
/ => name   age height sex
/ => ---------------------
/ => Thomas 32  175    m
select from t where (age<40)&(height=min height)
/ => name age height sex
/ => -------------------

/ The by clause falls between select and from
/ and is equivalent to SQL's GROUP BY
select avg height by sex from t
/ => sex| height
/ => ---| ------
/ => f  | 160
/ => m  | 177.5

/ If no aggregation function is specified, last is assumed
select by sex from t
/ => sex| name   age height
/ => ---| -----------------
/ => f  | Polly  52  160
/ => m  | Thomas 32  175

/ Update has the same basic form as select
update sex:`male from t where sex=`m
/ => name   age height sex
/ => ----------------------
/ => Arthur 35  180    male
/ => Thomas 32  175    male
/ => Polly  52  160    f

/ As does delete
delete from t where sex=`m
/ => name  age height sex
/ => --------------------
/ => Polly 52  160    f

/ None of these sql operations are carried out in place
t
/ => name   age height sex
/ => ---------------------
/ => Arthur 35  180    m
/ => Thomas 32  175    m
/ => Polly  52  160    f

/ Insert however is in place, it takes a table name, and new data
`t insert (`John;25;178;`m) / => ,3
t
/ => name   age height sex
/ => ---------------------
/ => Arthur 35  180    m
/ => Thomas 32  175    m
/ => Polly  52  160    f
/ => John   25  178    m

/ Upsert is similar (but doesn't have to be in-place)
t upsert (`Chester;58;179;`m)
/ => name    age height sex
/ => ----------------------
/ => Arthur  35  180    m
/ => Thomas  32  175    m
/ => Polly   52  160    f
/ => John    25  178    m
/ => Chester 58  179    m

/ it will also upsert dicts or tables
t upsert `name`age`height`sex!(`Chester;58;179;`m)
t upsert (`Chester;58;179;`m)
/ => name    age height sex
/ => ----------------------
/ => Arthur  35  180    m
/ => Thomas  32  175    m
/ => Polly   52  160    f
/ => John    25  178    m
/ => Chester 58  179    m

/ And if our table is keyed
kt:`name xkey t
/ upsert will replace records where required
kt upsert ([]name:`Thomas`Chester;age:33 58;height:175 179;sex:`f`m)
/ => name   | age height sex
/ => -------| --------------
/ => Arthur | 35  180    m
/ => Thomas | 33  175    f
/ => Polly  | 52  160    f
/ => John   | 25  178    m
/ => Chester| 58  179    m

/ There is no ORDER BY clause in q-sql, instead use xasc/xdesc
`name xasc t
/ => name   age height sex
/ => ---------------------
/ => Arthur 35  180    m
/ => John   25  178    m
/ => Polly  52  160    f
/ => Thomas 32  175    m

/ Most of the standard SQL joins are present in q-sql, plus a few new friends
/ see http://code.kx.com/q4m3/9_Queries_q-sql/#99-joins
/ the two most important (commonly used) are lj and aj

/ lj is basically the same as SQL LEFT JOIN
/ where the join is carried out on the key columns of the left table
le:([sex:`m`f]lifeexpectancy:78 85)
t lj le
/ => name   age height sex lifeexpectancy
/ => ------------------------------------
/ => Arthur 35  180    m   78
/ => Thomas 32  175    m   78
/ => Polly  52  160    f   85
/ => John   25  178    m   78

/ aj is an asof join. This is not a standard SQL join, and can be very powerful
/ The canonical example of this is joining financial trades and quotes tables
trades:([]time:10:01:01 10:01:03 10:01:04;sym:`msft`ibm`ge;qty:100 200 150)
quotes:([]time:10:01:00 10:01:01 10:01:01 10:01:03;
          sym:`ibm`msft`msft`ibm; px:100 99 101 98)
aj[`time`sym;trades;quotes]
/ => time     sym  qty px
/ => ---------------------
/ => 10:01:01 msft 100 101
/ => 10:01:03 ibm  200 98
/ => 10:01:04 ge   150
/ for each row in the trade table, the last (prevailing) quote (px) for that sym
/ is joined on.
/ see http://code.kx.com/q4m3/9_Queries_q-sql/#998-as-of-joins

////////////////////////////////////
/////     Extra/Advanced      //////
////////////////////////////////////

////// Adverbs //////
/ You may have noticed the total lack of loops to this point
/ This is not a mistake!
/ q is a vector language so explicit loops (for, while etc.) are not encouraged
/ where possible functionality should be vectorized (i.e. operations on lists)
/ adverbs supplement this, modifying the behaviour of functions
/ and providing loop type functionality when required
/ (in q functions are sometimes referred to as verbs, hence adverbs)
/ the "each" adverb modifies a function to treat a list as individual variables
first each (1 2 3;4 5 6;7 8 9)
/ => 1 4 7

/ each-left (\:) and each-right (/:) modify a two-argument function
/ to treat one of the arguments and individual variables instead of a list
1 2 3 +\: 11 22 33
/ => 12 23 34
/ => 13 24 35
/ => 14 25 36
1 2 3 +/: 11 22 33
/ => 12 13 14
/ => 23 24 25
/ => 34 35 36

/ The true alternatives to loops in q are the adverbs scan (\) and over (/)
/ their behaviour differs based on the number of arguments the function they
/ are modifying receives. Here I'll summarise some of the most useful cases
/ a single argument function modified by scan given 2 args behaves like "do"
{x * 2}\[5;1] / => 1 2 4 8 16 32 (i.e. multiply by 2, 5 times)
{x * 2}/[5;1] / => 32 (using over only the final result is shown)

/ If the first argument is a function, we have the equivalent of "while"
{x * 2}\[{x<100};1] / => 1 2 4 8 16 32 64 128 (iterates until returns 0b)
{x * 2}/[{x<100};1] / => 128 (again returns only the final result)

/ If the function takes two arguments, and we pass a list, we have "for"
/ where the result of the previous execution is passed back into the next loop
/ along with the next member of the list
{x + y}\[1 2 3 4 5] / => 1 3 6 10 15 (i.e. the running sum)
{x + y}/[1 2 3 4 5] / => 15 (only the final result)

/ There are other iterators and uses, this is only intended as quick overview
/ http://code.kx.com/q4m3/6_Functions/#67-iterators

////// Scripts //////
/ q scripts can be loaded from a q session using the "\l" command
/ for example "\l learnkdb.q" will load this script
/ or from the command prompt passing the script as an argument
/ for example "q learnkdb.q"

////// On-disk data //////
/ Tables can be persisted to disk in several formats
/ the two most fundamental are serialized and splayed
t:([]a:1 2 3;b:1 2 3f)
`:serialized set t / saves the table as a single serialized file
`:splayed/ set t / saves the table splayed into a directory

/ the dir structure will now look something like:
/ db/
/ ├── serialized
/ └── splayed
/     ├── a
/     └── b

/ Loading this directory (as if it was as script, see above)
/ loads these tables into the q session
\l .
/ the serialized table will be loaded into memory
/ however the splayed table will only be mapped, not loaded
/ both tables can be queried using q-sql
select from serialized
/ => a b
/ => ---
/ => 1 1
/ => 2 2
/ => 3 3
select from splayed / (the columns are read from disk on request)
/ => a b
/ => ---
/ => 1 1
/ => 2 2
/ => 3 3
/ see http://code.kx.com/q4m3/14_Introduction_to_Kdb+/ for more

////// Frameworks //////
/ kdb+ is typically used for data capture and analysis.
/ This involves using an architecture with multiple processes
/ working together. kdb+ frameworks are available to streamline the setup
/ and configuration of this architecture and add additional functionality
/ such as disaster recovery, logging, access, load balancing etc.
/ https://github.com/DataIntellectTech/TorQ

Want to know more?