Slow vertex arrays?
category: general [glöplog]
Hello, I am just trying to test other ways to draw polygons in OpenGL than the first I learned in NeHe. I have tried vertex arrays (not with indices yet) and surprisingly it's slower than even doing the same with simple glVertex commands. Why is this happening? Am I doing something wrong or is it normal?
The code:
const int num_scape_vertices = scape_width * scape_height * 4;
GLfloat scape_vertices[num_scape_vertices * 3];
void InitPlaneVarrays1()
{
int n = 0;
for (float z = -scape_height/2; z<scape_height/2; z++)
{
for (float x = -scape_width/2; x<scape_width/2; x++)
{
for (float j = 0.0f; j<=1.0f; j++)
{
for (float i = 0.0f; i<=1.0f; i++)
{
scape_vertices[n++] = x+fabs(i-j);
scape_vertices[n++] = 0.0f;
scape_vertices[n++] = z+j;
}
}
}
}
}
void DrawPlaneVarrays1()
{
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_FLOAT, 0, scape_vertices);
glDrawArrays(GL_QUADS, 0, num_scape_vertices);
glDisableClientState(GL_VERTEX_ARRAY);
}
void DrawStuff()
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glDisable(GL_DEPTH_TEST);
glDisable(GL_TEXTURE_2D);
glDisable(GL_BLEND);
glLoadIdentity();
glTranslatef(0.0f, -32.0f, -256.0f);
glColor4f(0.5f, 0.5f, 0.5f, 1.0f);
DrawPlaneVarrays1();
}
The simple GlVertex method (which sends the same exactly number of vertices, I mean even twice or more the connecting vertices without indices) was making 68fps, this vertex array method is making 39fps and the display list 462fps (I am planning to use this and then manipulate the vertex positions in the vertex shader). I know that if I use vertex arrays with indices I might get 4 times the speed and win over the simple method, but why the simple vertex array method with the same num of vertices looses in speed than even simple rows of glVertex3f commands? I have ATI HD 3650 here.
Are VBO-Buffer objects the next thing to try? Have they significant gain over any of the other methods? Are they compatible or I should better stick to display lists? Would display lists be the most ok thing if I want to send a static plane of vertices like here so that I will manipulate later on the vertex shader?
The code:
const int num_scape_vertices = scape_width * scape_height * 4;
GLfloat scape_vertices[num_scape_vertices * 3];
void InitPlaneVarrays1()
{
int n = 0;
for (float z = -scape_height/2; z<scape_height/2; z++)
{
for (float x = -scape_width/2; x<scape_width/2; x++)
{
for (float j = 0.0f; j<=1.0f; j++)
{
for (float i = 0.0f; i<=1.0f; i++)
{
scape_vertices[n++] = x+fabs(i-j);
scape_vertices[n++] = 0.0f;
scape_vertices[n++] = z+j;
}
}
}
}
}
void DrawPlaneVarrays1()
{
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_FLOAT, 0, scape_vertices);
glDrawArrays(GL_QUADS, 0, num_scape_vertices);
glDisableClientState(GL_VERTEX_ARRAY);
}
void DrawStuff()
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glDisable(GL_DEPTH_TEST);
glDisable(GL_TEXTURE_2D);
glDisable(GL_BLEND);
glLoadIdentity();
glTranslatef(0.0f, -32.0f, -256.0f);
glColor4f(0.5f, 0.5f, 0.5f, 1.0f);
DrawPlaneVarrays1();
}
The simple GlVertex method (which sends the same exactly number of vertices, I mean even twice or more the connecting vertices without indices) was making 68fps, this vertex array method is making 39fps and the display list 462fps (I am planning to use this and then manipulate the vertex positions in the vertex shader). I know that if I use vertex arrays with indices I might get 4 times the speed and win over the simple method, but why the simple vertex array method with the same num of vertices looses in speed than even simple rows of glVertex3f commands? I have ATI HD 3650 here.
Are VBO-Buffer objects the next thing to try? Have they significant gain over any of the other methods? Are they compatible or I should better stick to display lists? Would display lists be the most ok thing if I want to send a static plane of vertices like here so that I will manipulate later on the vertex shader?
Code:
glPushClientAttrib(GL_CLIENT_VERTEX_ARRAY_BIT);
...
glPopClientAttrib();
no idea if this will help..
VBOs are significantly faster, yes. It depends a bit on the drivers, but there's no reason not to use VBOs nowadays.
do to such stuff isnt dxd9 better?
use VBOs.
and no dx9 isn't faster because you're forced to use interleaved data as far as i understood. it means that if part of your vertex data is constant and the other part dynamic, you can't update only the dynamic part.
on the other hand the opengl pointer functions enable you to use whatever format suits best your use.
and no dx9 isn't faster because you're forced to use interleaved data as far as i understood. it means that if part of your vertex data is constant and the other part dynamic, you can't update only the dynamic part.
on the other hand the opengl pointer functions enable you to use whatever format suits best your use.
nystep, you can define several parallel streams and then keep some of them static and the others dynamic without any problem in dx9. You can't do that within _one_ vertex buffer but you can combine several vertex buffers for drawing.
What preacher said.
Replace the pointer with 0, and create a buffer object, bind it... It's just one or two extra function calls.
Replace the pointer with 0, and create a buffer object, bind it... It's just one or two extra function calls.
Quote:
nystep, you can define several parallel streams and then keep some of them static and the others dynamic without any problem in dx9. You can't do that within _one_ vertex buffer but you can combine several vertex buffers for drawing.
you can also do it in one vertex buffer, you just need to specify the correct offset on SetStreamSource. the one thing you can do in GL that you can't normally do in D3D9 is to have vertex and index data in the same buffer, but that's not really practically relevant one way or the other.
also, nystep, you've proven time and again that you don't have the slightest clue about d3d. it's beyond time to either get your facts straight (the d3d docs aren't exactly hard to come by) or, if that's too much work for you, to just shut up.
wasn't there a point in time where gpus had only 1 hw accelerated stream? which would make using parallel streams slow?
that sounds weird. i mean i can see how hardware would be capable of processing only a single (interleaved) data stream at once, but that's some old hw you're talking about i suppose? :)
yeah well my memory is blurry about that :/
all current hw supports multiple streams.
and, just in case, the same was true back in 2001.
seriously, everyone, stop it with the FUD already.
and, just in case, the same was true back in 2001.
seriously, everyone, stop it with the FUD already.
:-)
I think someone needs to take a beer and relax.
I think someone needs to take a beer and relax.
ogl vs d3d is irrelevant here. VBOs are nice to use so why not use them?
its not like anybody here makes money out of optimizing crap anyways. even if some people seem to think they are.
its not like anybody here makes money out of optimizing crap anyways. even if some people seem to think they are.
Quote:
ogl vs d3d is irrelevant here
it is *always* an irrelevant matter, at least in the way it's often discussed on this forum -- therefore it irritates me that some want to grab even the most minor of opportunities to fuel that fire time and time again.
this sounds like a very, very crappy ati driver. i can not imagine what code would make lots of glVertex calls slower than arrays. i can also hardly imagine what driver code would make lists 10x faster. imho the difference between vertices from vram or ram should be at max 4x. getting even a 2x performance difference on vertices is really unlikely.
i really suspect that the measuring is totally wrong.
i really suspect that the measuring is totally wrong.
I've had experiences with those crappy ATI drivers back in the day before VBOs. I had my data in an array and the frame rate difference between glVertex3f and glVertexPointer/glDrawArrays was negligible.
no actually the measurings seem right to me. if your application is only just sending glVertex commands it is actually quite fast (though not very useful/should be avoided if you have a whole game with AI/physics and stuff running as well). if you use static VBOs you should match the speed of your display lists. just keep in mind that display lists are slighly outdated and deprecated, and sometimes even unstable Optimus.
But yep, when the CPU only has that to do, sending glVertex commands is faster than vertex arrays. in fact if you think about it, it seems quite logical: the driver needs to copy your application memory block to a zone where it can start a DMA transfert or something to the gfx card. if you use glVertex the driver directly stores things in the memory zone that is cool, so it avoids a copy.
And optimus, keep in mind not to use direct3d. ;) it could make you loose your sense of humor obviously. :)
But yep, when the CPU only has that to do, sending glVertex commands is faster than vertex arrays. in fact if you think about it, it seems quite logical: the driver needs to copy your application memory block to a zone where it can start a DMA transfert or something to the gfx card. if you use glVertex the driver directly stores things in the memory zone that is cool, so it avoids a copy.
And optimus, keep in mind not to use direct3d. ;) it could make you loose your sense of humor obviously. :)
I can imagine a hundred reasons for glVertex being fast(er) in these kinds of minor situations (having the driver directly translating the data into a command buffer) -- does this operation indeed set up a DMA transfer every frame? Because then it's even quite obvious why it'd be slower than just force-feeding the (minor amount of) data into the command buffer.
(no OGL expertise here, so please do tell..)
(no OGL expertise here, so please do tell..)
well duh... glVertex passes in the stuff through function calls! the memory gets copied at least from the stack. why doing a lot of tiny copies from the stack vs one big copy from the heap should be faster does not make sense at all.
shiva: how exactly do those vertex/index pointers (in their simplest form, as decribed by Optimus' example) work? it can't be assumed that the data is static for any duration other than the call's so i guess it has to set up a copy rightaway. from there, from a hardware perspective afaik, it's either map some memory as to use cpu ram to draw from or dma it to an intermediate space in vram the driver keeps for these issues. or is that too simple?
because in both those cases i can still see some overhead killing any gain over just pushing 'immediate' vertices into the command stream 1-by-1. in small cases like this, of course.
because in both those cases i can still see some overhead killing any gain over just pushing 'immediate' vertices into the command stream 1-by-1. in small cases like this, of course.
shiva, one could argue that it's the _same_ stack address every time. If you look at the loop above and imagine doing exactly that via glVertex(), it can indeed be faster than copying the array because nothing is ever read from main memory - the data get generated, parked in the L1 cache, and then written out most possibly to write combined mem. Sounds faster to me than memcpy.
As data is often specified on a glVertex-level, quite some amount of optimization and foresight went into the driver at that point (eg building indexbuffers and reordering for cache-friendliness) while vertexarrays are copied as-is.
Ok, so i did a simple starfield effect using vertex arrays in order to show you how much they suck ;)
So, for starters, as you see, glVertexPointer is called only in the init, and in this case it is correct since our sample only does this and only uses one vertex array...
This underlines something. Since this code actually works and is animated, it means that the data is copied on the batch call: glDrawArrays. And indeed, if you check twice, the pointer command defines a number of elements, an offset to where to find the next element, and a pointer. So your driver has absolutelly no idea what's the size of the data block it is dealing with when you call glVertexPointer.
On the other side, when you call one of the batch calls, DrawArrays or DrawElements, your driver can know what is the interval of indices it needs to copy.
And it can absolutelly not assume data was not changed between 2 batch calls, since it is in the applications memory, and that's why the effect above is actually working.
So yes, vertex arrays really suck and you should give a try to VBO ;)
Code:
#define N_STARS 8000
static vec3d *stars;
static float speed;
static float random1d( int x )
{
int n = (x * (x * x * 75731 + 189221) + 1371312589);
return (float) (n & 0x7FFFFFFF) * (1.f / 2147483648.f);
}
void initStarfield()
{
stars = new vec3d [N_STARS];
for (int i=0; i<N_STARS; i++)
{
stars[i].x = (random1d( i*4 ) - 0.5f) * 200;
stars[i].y = (random1d( i*4+1 ) - 0.5f) * 200;
stars[i].z = (random1d( i*4+2 ) - 0.5f) * 200;
}
glVertexPointer( 3, GL_FLOAT, 0, stars );
glEnableClientState( GL_VERTEX_ARRAY );
}
void setStarfieldScrollSpeed( float s )
{
speed = s;
}
void updateStarfield( float dt )
{
const float translate = dt * speed;
for (int i=0; i<N_STARS; i++)
{
stars[i].z += translate;
stars[i].z = stars[i].z > 100.f ? stars[i].z - 200.f : stars[i].z;
}
}
void renderStarfield()
{
glDrawArrays( GL_POINTS, 0, N_STARS );
}
So, for starters, as you see, glVertexPointer is called only in the init, and in this case it is correct since our sample only does this and only uses one vertex array...
This underlines something. Since this code actually works and is animated, it means that the data is copied on the batch call: glDrawArrays. And indeed, if you check twice, the pointer command defines a number of elements, an offset to where to find the next element, and a pointer. So your driver has absolutelly no idea what's the size of the data block it is dealing with when you call glVertexPointer.
On the other side, when you call one of the batch calls, DrawArrays or DrawElements, your driver can know what is the interval of indices it needs to copy.
And it can absolutelly not assume data was not changed between 2 batch calls, since it is in the applications memory, and that's why the effect above is actually working.
So yes, vertex arrays really suck and you should give a try to VBO ;)
So as a result, if you do a DrawElements with indices 1, 3, 5, n+1 and then another DrawElements with indices 0, 2, 4, .. n,your driver will have to copy data twice even if it is not changed between the calls. I think it is what shiva misunderstood as he seems to think copy is done on array setup..